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Abstract 

The paper aims at reconsidering the famous Le Cam LAN theory. The main 
features of the approach which make it different from the classical one are: 
(1) the study is non-asymptotic, that is, the sample size is fixed and does not 
tend to infinity; (2) the parametric assumption is possibly misspecified and 
the underlying data distribution can lie beyond the given parametric family. 
These two features enable to bridge the gap between parametric and nonpara- 
metric theory and to build a unified framework for statistical estimation. The 
main results include a large deviation bound for the (quasi) maximum like- 
lihood and a local quadratic bracketing result for the log-likelihood process. 
The latter yields a number of important corollaries for statistical inference: 
concentration, confidence and risk bounds, expansion of the maximum like- 
lihood estimate, etc. All these corollaries are stated in a non-classical way 
admitting a model misspecification and finite samples. However, the classi- 
cal asymptotic results including the efficiency bounds can be easily derived 
as corollaries of the obtained non-asymptotic statements. At the same time, 
the new bracketing device works well in the situations with large or growing 
parameter dimension in which the classical parametric theory fails. The gen- 
eral results are illustrated for the i.i.d. set-up as well as for generalized linear 
modeling and median estimation. The results apply for any dimension of the 
parameter space and provide a quantitative lower bound on the sample size 
yielding the root-n accuracy. 
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1 Introduction 

One of the most popular approaches in statistics is based on the parametric assumption 
(PA) that the distribution IP of the observed data Y belongs to a given parametric 
family {]Pe,d G 6* C M^) , where p stands for the number of parameters. This as- 
sumption allows to reduce the problem of statistical inference about JP to recovering 
the parameter 9 . The theory of parameter estimation and inference is nicely developed 
in a quite general set-up. There is a vast literature on this issue. We only mention the 
book by Ibragimov and Khas'minskij (1981), which provides a comprehensive study of 
asymptotic properties of maximum likelihood and Bayesian estimators. The theory is 
essentially based on two major assumptions: (1) the underlying data distribution follows 
the PA; (2) the sample size or the amount of available information is large relative to the 
number of parameters. 

In many practical applications, both assumptions can be very restrictive and limit 
the scope of applicability for the whole approach. Indeed, the PA is usually only an ap- 
proximation of real data distribution and in most statistical problems it is too restrictive 
to assume that the PA is exactly fulfilled. Many modern statistical problems deal with 
very complex high dimensional data where a huge number of parameters are involved. 
In such situations, the applicability of large sample asymptotics is questionable. These 
two issues partially explain why the parametric and nonparametric theory are almost 
isolated from each other. Relaxing these restrictive assumptions can be viewed as an 
important challenge of the modern statistical theory. The present paper attempts at 
developing a unified approach which does not require the restrictive parametric assump- 
tions but still enjoys the main benefits of the parametric theory. The main steps of the 
approach are similar to the classical local asymptotic normality (LAN) theory; see e.g. 
Chapters 1~3 in the monograph Ibragimov and Khas'minskij (1981): first one localizes 
the problem to a neighborhood of the target parameter. Then one uses a local quadratic 
expansion of the log-likelihood to solve the corresponding estimation problem. There is, 
however, one feature of the proposed approach which makes it essentially different from 
classical scheme. Namely, the use of the bracketing device instead of classical Taylor 
expansion allows to consider much larger local neighborhoods than in the LAN theory. 
More specifically, the classical LAN theory effectively requires a strict localization to a 
root-n vicinity of the true point. At this point, the LAN theory fails in extending to the 
nonparametric situation. Our approach works for any local vicinity of the true point. 
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This opens the door to building a unified theory including most of classical parametric 
and nonparametric results. 

Let Y stand for the available data. Everywhere below we assume that the observed 
data Y follow the distribution iP on a metric space y . We do not specify any particular 
structure of 1^ . In particular, no assumption like independence or weak dependence of 
individual observations is imposed. The basic parametric assumption is that IP can be 
approximated by a parametric distribution IPg from a given parametric family (Pg , 9 G 
C JRP) . Our approach allows that the PA can be misspecified, that is, in general. 

Let L{Y,0) be the log-likelihood for the considered parametric model: L{Y,9) = 
log^^(l^) , where /Xq is any dominating measure for the family (iPe) . We focus on 
the properties of the process L{Y, 6) as a function of the parameter . Therefore, we 
suppress the argument Y there and write L[6) instead of L(Y ,6) . One has to keep 
in mind that L{6) is random and depends on the observed data Y . By L{0,6*) '= 
L{0)—L{6*) we denote the log-likelihood ratio. The classical likelihood principle suggests 
to estimate 6 by maximizing the corresponding log-likelihood function L{6) : 

=^ argmaxL(0). (1-1) 

Our ultimate goal is to study the properties of the quasi maximum likelihood estimator 
(MLE) 6 . It turns out that such properties can be naturally described in terms of 
the maximum of the process L{9) rather than the point of maximum 9 . To avoid 
technical burdens it is assumed that the maximum is attained leading to the identity 
maxg L{9) = L{9) . However, the point of maximum needs not to be unique. If there 
are many such points we take 9 as any of them. Basically, the notation 9 is used for 
the identity L{9) = sup^gg L{9) . 

If iP {iPg) , then the (quasi) MLE 9 from (1.1) is still meaningful and it appears 
to be an estimator of the value 9* defined by maximizing the expected value of L{9) : 

9* =^ argmaxE L{9) (1.2) 
6»e6> 

which is the true value in the parametric situation and can be viewed as the parameter 
of the best parametric fit in the general case. 

The results below show that the main properties of the quasi MLE 9 like concentra- 
tion or coverage probability can be described in terms of the excess which is the difference 
between the maximum of the process L{9) and its value at the "true" point 9* : 

l(9, 9*) = L(9) - L(9*) = maxL(9) - L(9*), 
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The established results can be split into two big groups. A large deviation bound states 
some concentration properties of the estimator 6. For specific local sets 6'o(r) with 
elliptic shape, the deviation probability iP(0 0o(^)) is exponentially small in r . This 
concentration bound allows to restrict the parameter space to a properly selected vicinity 
00 (r) . Our main results concern the local properties of the process L{6) within 0o{r) 
including a bracketing bound and its corollaries. 

The paper is organized as follows. Section 2 presents the list of conditions which are 
systematically used in the text. The conditions only concern the properties of the quasi 
log-likelihood process L(6) . 

Section 3 appears to be central in the whole approach and it focuses on local properties 
of the process L(6) within 00(1") . The idea is to sandwich the underlying (quasi) log- 
likelihood process L{9) for 6 G 0oi^) between two quadratic (in parameter) expressions. 
Then the maximum of L{6) over 0o{r) will be sandwiched as well by the maxima of the 
lower and upper processes. The quadratic structure of these processes help to compute 
these maxima explicitly yielding the bounds for the value of the original problem. This 
approximation result is used to derive a number of corollaries including the concentration 
and coverage probability, expansion of the estimator 6 , polynomial risk bounds, etc. In 
contrary to the classical theory, all the results are non-asymptotic and do not involve any 
small values of the form o(l) , all the terms are specified explicitly. Also the results are 
stated under possible model misspecification. 

Section 4 accomplishes the local results with the concentration property which bounds 
the probability that 9 deviates from the local set 0o{t) . In the modern statistical 
literature there is a number of studies considering maximum likelihood or more generally 
minimum contrast estimators in a general i.i.d. situation, when the parameter set is 
a subset of some functional space. We mention the papers Van de Geer (1993), Birge 
and Massart (1993), Birge and Massart (1998), Birge (2006) and references therein. The 
established results are based on deep probabilistic facts from empirical process theory; 
see e.g. Talagrand (1996, 2001, 2005), van der Vaart and Wellner (1996), Boucheron 
et al. (2003). The general result presented in Section B follows the generic chaining idea 
due to Talagrand (2005); cf. Bednorz (2006). However, we do not assume any specific 
structure of the model. In particular, we do not assume independent observations and 
thus, cannot apply the most developed concentration bounds from the empirical process 
theory. 

Section 5 illustrates the applicability of the general results to the classical case of 
an i.i.d. sample. The previously established general results apply under rather mild 
conditions. Basically we assume some smoothness of the log-likelihood process and some 
minimal number of observations pro parameter: the sample size should be at least of 
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order of the dimensionality p of the parameter space. We also consider the examples of 
generalized linear modeling and of median regression. 

It is important to mention that the non-asymptotic character of our study yields 
an almost complete change of the mathematical tools: the notions of convergence and 
tightness become meaningless, the arguments based on compactness of the parameter 
space do not apply, etc. Instead we utilize the tools of the empirical process theory based 
on the ideas of concentration of measures and nonasymptotic entropy bounds. Section A 
in the Appendix presents an exponential bound for a general quadratic form which is very 
essential for getting the sharp risk bounds for the quasi MLE. This bound is an important 
step in the concentration results for the quasi MLE. Section B explains how generic 
chaining and majorizing measure device by Talagrand (2005) refined in Bednorz (2006) 
can be used for obtaining a general exponential bound for the log-likelihood process. 

The proposed approach seems to be very promising and can be useful in many further 
research directions. Here we briefly outline some of them. The results of Sections 3 and 4 
indicate the important role of the dimensionality of the parameter space. Namely, in the 
regular case, the entropy of the parametric family and hence, all the established bounds 
are proportional to the number of parameters to be estimated. This makes the results 
almost uninformative when the dimensionality becomes large relative to the sample size. 
The nonparametric theory operates with parameter sets of infinite dimension under some 
entropy restriction. An important advantage of the developed technique is that it is ex- 
tendable to the case of the penalized likelihood. A proper choice of penalization allows to 
reduce the entropy bounds and to get sharp results even in the case of a large dimensional 
parameter space in terms of the so called effective dimension; Spokoiny (2012). 

This paper focuses on the maximum likelihood estimation. However, the bracketing 
device appears to be very helpful and efficient for analysis of the posterior measure 
in the Bayes approach as well. The subsequent paper Spokoiny (2012) shows how the 
prominent Bernstein - von Mises result for quasi posteriors can be obtained using the new 
methodology. It also explains an interesting relation between the penalized maximum 
likelihood estimation and the Bayes method with Gaussian priors. 

The proposed methodology has been successfully applied to the problem of bandwidth 
selection in local quantile regression, Spokoiny et al. (2012). An extension of the theory 
to the case of a semiparametric estimation is another important issue. The modern semi- 
parametric theory is quite involved and requires special tools like the hardest parametric 
subspace. The new approach can be directly applied to the situation when the target of 
analysis is a projection (mapping) of the parameter vector onto some subspace; Andresen 
and Spokoiny (2012). Under some restriction on the dimension or on the entropy of the 
nuisance parameter, the bracketing device continues to work. The established results 
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include the semiparametric efficiency of the MLE and the Wilks theorem. 

2 Conditions 

Below we collect the list of conditions which are systematically used in the text. It seems 
to be an advantage of the whole approach that all the results are stated in a unified way 
under the same conditions. Once checked, one obtains automatically all the established 
results. We do not try to formulate the conditions and the results in the most general 
form. In some cases we sacrifice generality in favor of readability and ease of presentation. 
It is important to stress that all the conditions only concern the properties of the quasi 
likelihood process L{6) . Even if the process L(-) is not a sufficient statistic, the whole 
analysis is entirely based on its geometric structure and probabilistic properties. The 
conditions are not restrictive and can be effectively checked in many particular situations. 
Some examples are given in Section 5 for i.i.d setup, generalized linear models, and for 
median regression. 

The imposed conditions can be classified into the following groups by their meaning: 

• smoothness conditions on L{0) allowing the second order Taylor expansion; 

• exponential moment conditions; 

• identifiability and regularity conditions; 

We also distinguish between local and global conditions. The global conditions concern 
the global behavior of the process L(6) while the local conditions focus on its behavior 
in the vicinity of the central point 9* . Below we suppose that degree of locality is 
described by a number r . The local zone corresponds to r < rg for a fixed tq . The 
global conditions concern r > . 

2.1 Local conditions 

Local conditions describe the properties of L{6) in a vicinity of the central point 9* 
from (1.2). 

To bound local fluctuations of the process L{6) , we introduce an exponential moment 
condition on the stochastic component Ci^) '■ 

C{e) =^ L{e) - 1EL{9). 

Below we suppose that the random function Ci^) is differentiable in 9 and its gradient 
VC(0) = dC,{9)/d9 S IRP has some exponential moments. Our first condition describes 
the property of the gradient 'SJC,{9*) at the central point 9* . 
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(EDq) There exist a positive symmetric matrix Vq , and constants g > , uq > 1 
such that Var{VC(0*)} < and for all |A| < g 

sup logifc/exp< A — fjT- — — > < t'oA /2. 

7GKP I II>4)7|| J 

In typical situation, the matrix Vq can be defined as the covariance matrix of the 
gradient vector VC{0*): = Var(VC(6'*)) = Var(VL(0*)) . If L{e) is the log- 
hkelihood for a correctly specified model, then 6* is the true parameter value and 
coincides with the corresponding Fisher information matrix. The matrix Vq shown in 
this condition determines the local geometry in the vicinity of 6* . In particular, define 
the local elliptic neighborhoods of 0* as 

0o(r) = {0G0: 11^0(0 -0*)|| <r}. (2.1) 

The further conditions are restricted to such defined neighborhoods 0oi^) ■ 

{ED\) For each r < rg , there exist a constant w(r) < 1/2 such that it holds for all 
e G 0o(r) 

logEexplA ^^'^f < .^V2. |A| < g. 

i&mp L w(r)||Vb7ll J 

The main bracketing result also requires second order smoothness of the expected log- 
likelihood 1EL{0). By definition, L{9*,9*) = and VlEL{e*) = because 9* is the 
extreme point of ]EL{9) . Therefore, —1EL{9,9*) can be approximated by a quadratic 
function of 9—9* in the neighborhood of 9* . The local identifiahility condition quantifies 
this quadratic approximation from above and from below on the set (9o(r) from (2.1). 

(/Co) There are a symmetric strictly positive- definite matrix Dq and for each r < rg 
and a constant 6{t) < 1/2, such that it holds on the set 0q{t) = {9 : \\Vq{9 — 
9*)\\<r} 



-21EL(9,9*) ^ 



\\Do{9-9* 



< '5(r). 



Usually L>g is defined as the negative Hessian of 1EL{9*) : Dl = -V'^1EL{9*) . If 
L{9, 9*) is the log-likelihood ratio and P = Pg* then -PL{9, 9*) = Eg* log{dPg* /dPg) 
%{Pg*,Pg), the Kullback-Leibler divergence between Pg* and Pg . Then condition 
(£o) with Dq = Vq follows from the usual regularity conditions on the family {Pe) ; 
cf. Ibragimov and Khas'minskij (1981). If the log-likelihood process L{9) is sufficiently 
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smooth in , e.g. three times stochastically differentiable, then the quantities u){r) and 
6{r) can be taken proportional to the value Q{r) defined as 

p(t) '= max 110 — 0*11. 

ee0o{r) 

In the important special case of an i.i.d. model one can take = uj*r/n^^'^ and 

5{r) = 5*r/n^/'^ for some constants (jJ*,5* ; see Section 5.1. 

The identifiability condition relates the matrices Dq and Vq . 

(X) There is a constant o > such that o^Dq > Vq . 
2.2 Global conditions 

The global conditions have to be fulfilled for all 6 lying beyond 6'o(ro) • We only impose 
one condition on the smoothness of the stochastic component of the process L{0) in term 
of its gradient, and one identifiability condition in terms of the expectation 1EL{6, 0*) . 

The first condition is similar to the local condition {EDq) and it requires some ex- 
ponential moment of the gradient VC(0) for all 6 £ . However, the constant g may 
be dependent of the radius r = ||Vo(^ — ^*)|| • 

{Er) For any r , there exists a value g(r) > such that for all A < g(r) 

sup sup logiEexp/A^— ^^^1 < z^^AV2. 
0e0o(r) feMp [ WolW J 

The global identification property means that the deterministic component ]EL{6, 6*) 
of the log-likelihood is competitive with its variance Var L{9, 9*) . 

(Cr) There is a function b(r) such that rb(r) monotonously increases in r and for 
each r > ro 

inf \lEL(e,e*)\ > b(r)r^ 
0:\\Vo{e-e*)\\=T^ " 

3 Local inference 

The Local Asymptotic Normality (LAN) condition since introduced in Le Cam (1960) 
became one of the central notions in the statistical theory. It postulates a kind of lo- 
cal approximation of the log-likelihood of the original model by the log-likelihood of a 
Gaussian shift experiment. The LAN property being once checked yields a number of im- 
portant corollaries for statistical inference. In words, if you can solve a statistical problem 
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for the Gaussian shift model, the result can be translated under the LAN condition to the 
original setup. We refer to Ibragimov and Khas'minskij (1981) for a nice presentation of 
the LAN theory including asymptotic efficiency of MLE and Bayes estimators. The LAN 
property was extended to mixed LAN or Local Asymptotic Quadraticity (LAQ); see e.g. 
Le Cam and Yang (2000). All these notions are very much asymptotic and very much 
local. The LAN theory also requires that L{6) is the correctly specified log-likelihood. 
The strict localization does not allow for considering a growing or infinite parameter 
dimension and limits applications of the LAN theory to nonparametric estimation. 

Our approach tries to avoid asymptotic constructions and attempts to include a pos- 
sible model misspecification and a large dimension of the parameter space. The presen- 
tation below shows that such an extension of the LAN theory can be made essentially 
by no price: all the major asymptotic results like Fisher and Cramer-Rao information 
bounds, as well as the Wilks phenomenon can be derived as corollaries of the obtained 
non-asymptotic statements simply by letting the sample size to infinity. At the same 
time, it applies to a high dimensional parameter space. 

The LAN property states that the considered process L{0) can be approximated by 
a quadratic in 9 expression in a vicinity of the central point 9* . This property is usually 
checked using the second order Taylor expansion. The main problem arising here is that 
the error of the approximation grows too fast with the local size of the neighborhood. 
Section 3.1 presents the non-asymptotic version of the LAN property in which the local 
quadratic approximation of L{6) is replaced by bounding this process from above and 
from below by two different quadratic in 9 processes. More precisely, we apply the 
bracketing idea: the difference L{6,6*) = L{9) — L{6*) is put between two quadratic 
processes Le(0,0*) and Lg(0,0*): 

L,(0,r)-Oe <i^(0,r) <Le(0,r) + ^e, 0G0o(r), (3.1) 

where e is a numerical parameter, e = — e , and Oe and Oe are stochastic errors which 
only depend on the selected vicinity 0o(i") • The upper process Lg(0, 9*) and the lower 
process Le(0, 9*) can deviate substantially from each other, however, the errors {>g 
remain small even if the value r describing the size of the local neighborhood 6'o(r) is 
large. 

The sandwiching result (3.1) naturally leads to two important notions: the value of 
the problem and the spread. It turns out that most of the statements like confidence 
and concentration probability rely upon the maximum of L{9,9*) over 9 which we 
call the excess. Its expectation will be referred to as the value of the problem. Due to 
(3.1) the excess can be bounded from above and from below using the similar quantities 
maxgl^£(9,9*) and maxg 1^^(9,9*) which can be called the lower and upper excess, 
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while their expectations are the values of the lower and upper problems. Note that 
inaxg{L^{9, 6*) — Le(0, 0*)} = oo . However, this is not crucial. What really matters is 
the difference between the upper and the lower excess. The spread Z\e can be defined 
as the width of the interval bounding the excess due to (3.1), that is, as the sum of the 
approximation errors and of the difference between the upper and the lower excess: 

=^ Oe + Oe + {maxL,(0,r) - max (6>, 0* ) } . 


The range of applicability of this approach can be described by the following mnemonic 
rule: "The value of the upper problem is larger in order than the spread." The further 
sections explain in details the meaning and content of this rule. Section 3.1 presents the 
key bound (3.1) and derives it from the general results on empirical processes. Section 3.2 
presents some straightforward corollaries of the bound (3.1) including the coverage and 
concentration probabilities, expansion of the MLE and the risk bounds. It also indicates 
how the classical results on asymptotic efficiency of the MLE follow from the obtained 
non-asymptotic bounds. 

3.1 Local quadratic bracketing 

This section presents the key result about local quadratic approximation of the quasi 
log-likelihood process given by Theorem 3.1 below. 

Let the radius r of the local neighborhood 0o{r) be fixed in a way that the deviation 
probability ]P(6 ^ 0o(r)) is sufficiently small. Precise results about the choice of r 
which ensures this property are postponed until Section 4. In this neighborhood 6'o(r) 
we aim at building some quadratic lower and upper bounds for the process L[6) . The 
first step is the usual decomposition of this process into deterministic and stochastic 
components: 

Lie) = iEL{o) + ae), 

where C{^) = -^(^) ~ ]EL{0) . Condition (£o) allows to approximate the smooth deter- 
ministic function ]EL{0) — ]EL{0*) around the point of maximum 0* by the quadratic 
form — ||L'o(^ — ^*)|P/2 . The smoothness properties of the stochastic component C,{9) 
given by conditions (EDq) and [EDi) leads to linear approximation C(0) — C(^*) ~ 
{6 — 0*)^VC,{0*) . Putting these two approximations together yields the following ap- 
proximation of the process L{6) on 6'o(r) : 

L{0,e*) ^L{e,e*) =^ {e - 0*)^vc(0*) - \\Do{e - 0*)f /2. (3.2) 

This expansion is used in most of statistical calculus. However, it does not suit our 
purposes because the error of approximation grows quadratically with the radius r and 
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starts to dominate at some critical value of r . We slightly modify the construction by 
introducing two different approximating processes. They only differ in the deterministic 
quadratic term which is either shrunk or stretched relative to the term ||-Do(^ ~ 
in L(0,0*). 

Let 6, g be nonnegative constants. Introduce for a vector e = {6, g) the following 
notation: 

Le(0,r) =^ {e - e*yvL{e*) - \\De{e - e*)f /2 

= CDeiO - e*) - \\D,{0 - e*)f/2, (3.3) 

where 

Dl = Dlil -6)- gVl D-^VL{e*). 

Here we implicitly assume that with the proposed choice of the constants 6 and g , the 
matrix is non-negative: > . The representation (3.3) indicates that the process 
Le(0, 6*) has the geometric structure of log-likelihood of a linear Gaussian model. We do 
not require that the vector is Gaussian and hence, it is not the Gaussian log-likelihood. 
However, the geometric structure of this process appears to be more important than its 
distributional properties. 

One can see that if 6, g are positive, the quadratic drift component of the process 
L£(0,0*) is shrunk relative to L(0,0*) in (3.2) for e positive and it is stretched if 
6, g are negative. Now, given r, define 6 = S{r) , g = 3foc<j(r) with the value S{r) 
from condition (£o) a-^d uj{r) from condition (EDi) . Finally set e = — e , so that 
Dl = Di{l + S) + gVl 

Theorem 3.1. Assume (EDi) and {'Cq) . Let for some r, the values g > 3uQU}{r) and 
6 > (5(r) be such that D^{1 - 6) - gV^ > . Then 

L,(0,0*)-Oe(r) <L(0,0*) <L,(0,r) + Oe(r), 9e0o{r), (3.4) 

with L,f^{6,0*),h^{0,6*) defined by {3.3) . The error terms Oe(i") and <)e(r) satisfy the 
bound (3.11) from Proposition 3.7. 

Remark 3.1. This bracketing bound (3.4) describes some properties of the log-likelihood 
process and the estimator 9 is not shown there. However, it directly implies most of 
our inference results. We therefore formulate it as a separate statement. Section 3.3 
below presents some exponential bounds on the error terms <!)e(r) and <C>e(r) . The 
main message is that under rather broad conditions, these errors are small and have only 
minor impact on the inference for the quasi MLE 6 . 
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3.2 Local inference 

This section presents a list of corollaries from the basic approximation bounds of Theo- 
rem 3.1. The idea is to replace the original problem by a similar one for the approximating 
upper and lower models. It is important to stress once again that all the corollaries only 
rely on the bracketing result (3.1) and the geometric structure of the processes Lg and 
Le . Define the spread ^e(r) by 

A^ir) Oe(r) + 0.(r) + {Uef " Uef)/2. (3.5) 

Here = D~'^\7L{e*) and = D~'^VL{e*) . The quantity Z^g(r) appears to be the 
price induced by our bracketing device. Section 3.3 below presents some probabilistic 
bounds on the spread showing that it is small relative to the other terms. All our 
corollaries below are stated under conditions of Theorem 3.1 and implicitly assume that 
the spread can be nearly ignored. 

3.2.1 Local coverage probability 

Our first result describes the probability of covering 6* by the random set 

£(3) = {0:2L(0,0)<3}. (3.6) 

Corollary 3.2. For any 3 > 

iP{£(3) ^e*,ee 0o(r)} < lP{Uef > I - Oe(r)}. (3.7) 

Proof. The bound (3.7) follows from the upper bound of Theorem 3.1 and the statement 
(3.12) of Lemma 3.8 below. □ 

Below; see (3.14), we also present an exponential bound which helps to answer a very 
important question about a proper choice of the critical value 3 ensuring a prescribed 
covering probability. 

3.2.2 Local expansion, Wilks theorem, and local concentration 

Now we show how the bound (3.4) can be used for obtaining a local expansion of the 
quasi MLE 6 . All our results will be conditioned to the random set Cg(r) defined as 

Ce(r) -^^^f {6 e 0o(r), \\VoD-'U < r}- (3.8) 

Below in Section 3.3 we present some upper bounds on the value r ensuring a dominating 
probability of this random set. 

The first result can be viewed as a finite sample version of the famous Wilks theorem. 
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Corollary 3.3. On the random set Ce(r) from (3.8), it holds 

II^JI 2/2 - Oe(r) < L{e,e*) < ||^,||V2 + 0e(r). (3.9) 

The next result is an extension of another prominent asymptotic result, namely, the 
Fisher expansion of the MLE. 

Corollary 3.4. On the random set Cg(r) from (3.8), it holds 

\\D,(e-e*)-^\^ <2A,{r). (3.10) 

The proof of Corollaries 3.3 and 3.4 relies on the solution of the upper and lower 
problems and it is given below at the end of this section. 

Now we describe concentration properties of 6 assuming that is restricted to 
(9o(r) . More precisely, we bound the probability that ||-De(^ — ^*)|| > z for a given 
z > 0. 

Corollary 3.5. For any z > , it holds 

IP{\\D,{e-e*)\\> z, C,(r)} < ]P{\\i^\\> z-^2a:^)} 

An interesting and important question is for which 3 in (3.6) the coverage probability 
of the event {£(3) 9 0*} or for which z, the concentration probability of the event 
{||-De(0 — 6*)\\ < z} becomes close to one. It will be addressed in Section 3.3. 

3.2.3 A local risk bound 

Below we also bound the moments of the excess L{0,6*) and of the normalized loss 
D^{d—0*) when 9 is restricted to 6'o(r) . The result follows directly from Corollaries 3.3 
and 3.4. 

Corollary 3.6. For u > 

lE{L^ie, r ) l{e e 0o(r)) } < ^[{ ll^ell V2 + Oe(r)}"] . 
Moreover, it holds 

iE{\me-e*)rHc,{r))} < ]E[{\\u + V^^)r]- 

3.2.4 Comparing w^ith the asymptotic theory 

This section briefly discusses the relation between the established non-asymptotic bounds 
and the classical asymptotic results in parametric estimation. This comparison is not 
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straightforward because the asymptotic theory involves the sample size or noise level as 
the asymptotic parameter, while our setup is very general and works even for a "single" 
observation. Here we simply treat e = (5, g) as a small parameter. This is well justified 
by the i.i.d. case with n observations, where it holds 6 = 5{r) x y^r/n and similarly 
for g ; see Section 5 for more details. The bounds below in Section 3.3 show that the 
spread ^e(r) from (3.5) is small and can be ignored in the asymptotic calculations. The 
results of Corollary 3.2 through 3.6 represent the desired bounds in terms of deviation 
bounds for the quadratic form . 

For better understanding the essence of the presented results, consider first the "true" 
parametric model with the correctly specified log-likelihood L{6) . Then Dq = Vq is 
the total Fisher information matrix. In the i.i.d. case it becomes nfo where fo is the 
usual Fisher information matrix of the considered parametric family at 6* . In particular, 
Var|VL(0*)} = nfo . So, if is close to Dq , then can be treated as the normalized 
score. Under usual assumptions, ^ '= D^^\/L{6*) is asymptotically standard normal 
p -vector. The same applies to . Now one can observe that Corollary 3.2 through 3.6 
directly imply most of classical asymptotic statements. In particular. Corollary 3.3 shows 
that the twice excess 2L{6,6*) is nearly and thus nearly Xp (Wilks Theorem). 

Corollary 3.4 yields the expansion D^(6 — 0*) ~ (the Fisher expansion) and hence, 
D^(6 — 0*) is asymptotically standard normal. Asymptotic variance of D^(6 — 0*) is 
nearly one, so achieves the Cramer-Rao efficiency bound in the asymptotic set-up. 

3.3 Spread 

This section presents some bounds on the value 

zA,(r) ""M 0,(r) + Oe(r) + {U£ - U£)l2. 

This quantity is random but it can be easily evaluated under the conditions made. We 
present two different results: one bounds the errors <^e(r), {>e(r') , while the other presents 
a deviation bound on quadratic forms like ||^g|P . The results are stated under conditions 
(EDq) and {EDi) in a non-asymptotic way, so the formulation is quite technical. An 
informal discussion at the end of this section explains the typical behavior of the spread. 
The first result accomplishes the bracketing bound (3.4). 

Proposition 3.7. Assume {EDi) . The error term <)e(r) in (3.4) fulfills 



lP{g~^<>,{r) > 3o(x, Q)} < exp(-x) 



(3.11) 
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with 3o(x,Q) given for go = gfo > 3 by 



3o(x,Q) = 



[l + V^TQ? i/ 1 + V^T+Q < go, 



1 + { 2go ^ (x + Q) + go } ^ otherwise, 
where Q = Cip with ci = 2 for p >2 and Ci = 2.7 for p = 1 . Similarly for Oe(r) . 

Remark 3.2. The bound (3.11) essentially depends on the value g from condition 
(EDi) . The result requires that gfo ^ 3. However, this constant can usually be taken 
of order n^/^ ; see Section 5 for examples. If g^ is larger in order than p + x , then 
3o(x,Q) ^ cip + x. 

Proof. Consider for fixed r and e = (6, g) the quantity 

0,(r)1^' sup {L{e,e*)-iEL{e,e*)-{e-o*)^vL{e*)-^\\Vo{e-e*)f}. 

As 5 > 6{r), it holds -IEL{e,e*) > (1 - 6)D^ and L{e,e*) -^^{0,0*) < Oe(r) . 
Moreover, in view of VIEL{0*) = 0, the definition of Oe(r) can be rewritten in the 
form 

O.ir)"^ sup {ae,e*)-{e-e*Yvae*)-^\me-e*)f}. 

0eeo(r) ^ 

Now the claim of the theorem can be easily reduced to an exponential bound for the 
quantity <0>e(r') . We apply Theorem B.12 to the process 

u(0,r) = ^{Q{e,e*) - ie-e*)^vao*)}, o g 0o(r), 

and Hq = Vq . Condition {E-D) follows from (EDi) with the same and g in view of 
VU{e,e*) = {VC(0) - VC(0*)}/t^(r) . So, the conditions of Theorem B.12 are fulfilled 
yielding (3.11) in view oi g > 3i'oio{r) . □ 

Due to the main bracketing result, the local excess sup^ggj^^ (j.) can be put between 
similar quantities for the upper and lower approximating processes up to the error term 
Oe(r) . The random quantity sup0Le(0,0*) can be called the upper excess while 
supgh^^OjO*) is the lower excess. The quadratic (in 6) structure of the functions 
Le(0,0*) and 1^^(6,6*) enables us to explicitly solve the problem of maximizing the 
corresponding function w.r.t. 6 . Next result describes the upper and lower excesses 

sup0M0,r). 

Lemma 3.8. It holds 

sup Le(6>,r) = ||^J|V2. (3.12) 
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On the random set {\\VqD^ ^$,e\\ — ^} > ^^-^o holds 

sup M0,0) = ||^ellV2. 

ee0o{-r) 

Proof. The unconstrained maximum of the quadratic form Le(0,0*) w.r.t. 6 is attained 
at Be = De^^e = D^^V L{0*) yielding the expression (3.12). The lower excess is 
computed similarly. □ 

Our next step is in bounding the difference ||^e|P — ||^e|P • It can be decomposed as 

\\Lf-\\Lf = Uef-Uf + m\'-Uef 



with ^ = D^^VL{e*) . If the values 5, q are small then the difference I If, 1 1 — life II 
automatically smaller than ||f |p . 



Lemma 3.9. Suppose (X) and let t^'= 5 + ga^ < 1 . Then 

Dl>{l-T,)Dl Dl<{l + T,)D^ 



0' 



1-2 n II ^ „, 45f 2'^< 



Ip - D,D,'D,\\^ < a, = (3.13) 



Moreover, 



|£ ||2 _ < _ lie Ip < 

l^ell ll^ll - ll^ll ' ll^ll ll^^ll - 1 + r ll^l 



\\W-Uef<C.Mf- 

Our final step is in showing that under {EDq) , the norm ||f || behaves essentially as a 
norm of a Gaussian vector with the same covariance matrix. Define for IB '= Dq^VqDq^ 

po =' tr(iB), v2 2tii]B^), Ao =' ||iB||oo = Xm>^{lB). 

Under the identifiability condition (X) , one can bound 

< a^Ip, Po < a^P, < 2a^p, Aq < a^. 

Similarly to the previous result, we assume that the constant g from condition [EDq) 
is sufficiently large, namely > 2po . Define /ic = 2/3 and 

2 def 2/2 / 
Yc = g ll^c - Po/l^c, 

def 



gc = AtcYc = \/g^ - /^cPO, 

2xc =^ ficVc + logdet(/p - ^c^V-^o)- 
It is easy to see that > 3g^/2 and g^ > a/2/3 g . 
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Theorem 3.10. Let {EDq) hold with z^o = 1 o,nd > 2po . Then < po , and 

for each x < Xc 

iPdl^ll VAo > m) < 2e-^ + 8.4e-=^% (3.14) 
where 3(x, JB) is defined by 

, ju^ dcf [po + 2vxV2, x<^/l8^ 
3(x,iB) = <^ 

I po + 6x v/18 < X < Xc, 

Moreover, for x > Xc , it holds with ^(x, IB) = ly^ + 2(x — Xc)/gc 

JP(||^||VAo>3(x,iB)) < 8.4e-^ 

Proof. It follows from condition (EDq) that 

iBll^lP = lEtr^^,'^ 

= tr DQ^[lEVL{0*){VL{e*)}'^]DQ^ = tr [l^j^^ Var{VL(r)}] 

and (EDq) implies •y'^ YaT:{\/ L{e*)}'y < •y^V^-y and thus, lE\\^f<po. The deviation 
bound (3.14) is proved in Corollary A. 12. □ 

Remark 3.3. This small remark concerns the term 8.4e~^'= in the probability bound 
(3.14). As already mentioned, this bound implicitly assumes that the constant g is large 
(usually g X n^/2 ). Then x^ >i g^ >i n is large as well. So, e"^'' is very small and 
asymptotically negligible. Below we often ignore this term. For x < Xc , we can use 
3(x,iB) = po + 6x. 

Remark 3.4. The exponential bound of Theorem 3.10 helps to describe the critical value 
of 3 ensuring a prescribed deviation probability JP(||^|p > 3) . Namely, this probability 
starts to gradually decrease when 3 grows over Aopo . In particular, this helps to answer 
a very important question about a proper choice of the critical value 3 providing the 
prescribed covering probability, or of the value z ensuring the dominating concentration 
probability lP{\\D^(e - e*)\\ < z) . 

The definition of the set Ce(r) from (3.8) involves the event { 11^0-^7 ""^^ell > r} . Un- 
der (X) , it is included in the set {||^g|| > (1 + ae)^"'^o^"'^r} , see (3.13), and its probability 
is of order e~^ for > C{x+ p) with a fixed C > . 

By Theorem 3.7, one can use max{Oe(r), {>e(r)} < £>3o(x, Q) on a set of probability 
at least 1 — 2e~^ . Further, ||^|P/Ao < 3(x,iB) with a probability of order 1 — 2e~^ ; 
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see (3.14). Putting together the obtained bounds yields for the spread with a 

probabihty about 1 — 4e~^ 

Z\,(r) < 2£)3o(x,Q) + aeAo3(x,iB). 

The results obtained in Section 3.2 are sharp and meaningful if the spread ^e(i') is 
smaller in order than the value . Theorem 3.10 states that does not signifi- 

cantly deviate over its expected value po =^ iEH^lp which is our leading term. We know 
that 3o(X) Q) ~ Q + X = cip + X if X is not too large. Also 3(x, IB) < po + 6x , where po 
is of order p due to (X) . Summarizing the above discussion yields that the local results 
apply if the regularity condition (X) holds and the values g and Og , or equivalently, 
Ci;(r),5(r) are small. In Section 5 we show for the i.i.d. example that x y^r^/n 

and similarly for 5{r) . 

3.4 Proof of Corollaries 3.3 and 3.4 

The bound (3.4) together with Lemma 3.8 yield on Cg(r) 

L{e,e*) = sup L{e,e*) 

eeeo(r) 

> sup Le(0,r)-^,(r) = ||^,||V2-^e(r). (3.15) 

Similarly 

L(e, r ) < sup L,(0, e*) + Oe(r) < V2 + Oe(r) 

0G©o(r) 

yielding (3.9). For getting (3.10), we again apply the inequality L{d,d*) < he{0,d*) + 
^e(r) from Theorem 3.1 for 6 equal to 0. With = D-^VL{e*) and =^ D^(e - 
9*) , this gives 

L(9,e*)-Cu, + \\u,\\y2 < Oe(r). 

Therefore, by (3.15) 

||^,||V2 - Oe{r) - $Ju, + \\u,f/2 < Oe(r) 

or, equivalently 

||^,||V2 - ^Ju, + < Oe{r) + 0.(r) + {U£ - U£) /2 

II 1 1 2 

and the definition of A^{r) implies H-Ug — < 2A^{t) . 
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4 Upper function approach and concentration of the qMLE 

A very important step in the analysis of the qMLE 6 is localization. This property means 
that concentrates in a smah vicinity of the central point 0* . This section states such 
a concentration bound under the global conditions of Section 2. Given ro , the deviation 
bound describes the probability ]P{0 ^ 6'o(ro)) that 6 does not belong to the local 
vicinity 0o(i"o) of 0- The question of interest is to check a possibility of selecting ro 
in a way that the local bracketing result and the deviation bound apply simultaneously; 
see the discussion at the end of the section. 

Below we suppose that a sufficiently large constant x is fixed to specify the accepted 
level be of order e~^ for this deviation probability. All the constructions below depend 
upon this constant. We do not indicate it explicitly for ease of notation. 

The key step in this large deviation bound is made in terms of an upper function for 
the process L{e, e*) = L{e) - L{e*) . Namely, u(0) is a deterministic upper function if 
it holds with a high probability: 

sup{L(0,0*) +u(0)} < (4.1) 
e&0 

Such bounds are usually called for in the analysis of the posterior measure in the Bayes 
approach. Below we present sufficient conditions ensuring (4.1). Now we explain how 
the concentration bounds can be derived from (4.1). Let u{0) be an upper function. It 
can be used for describing the concentration sets for 6 . 

Lemma 4.1. Let u{0) be an upper function in the sense 

]p(snp{L{e, e*) + u{e)] > o) < e"^ (4.2) 

for X > . Given a subset Oq C with 6* £ 0q , the condition u{9) > for 6 ^ 0q 
ensures 

Proof. If 0° is a subset of not containing 6* , then the event £ 0° is only possible 
if supegeo L{e, e*)>0, because 1(6*, 6*) = . This yields the result. □ 

A possible way of checking the condition (4.2) is based on a lower quadratic bound for 
the negative expectation —]EL{6,6*) > b(r)||Vo(0 — 0*)|p/2 in the sense of condition 
(£r) from Section 2.2. We present two different results. The first one assumes that the 
values b(r) can be fixed universally for all r > ro . 
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Theorem 4.2. Suppose (Er) and (£r) with b(r) = b . Let, for r > rg ; 



1 + Vx + q < 3z.2g(r)/b, (4.3) 
6z^oVx + Q < rb, (4.4) 

with X + Q > 2.5 and Q = Cip . Then 

P(e 0o(ro)) < e-^ (4.5) 

Proof. The result follows from Theorem B.8 with /i = , t(/i) = 0, 11(0) = -L(0) — 
iEL(6>) and M{e ,6*) = -]EL{e ,6*) >\\\Vo{e - e*)f . □ 

Remark 4.1. The bound (4.5) requires only two conditions. Condition (4.3) means that 
the value g(r) from condition {Er) fulfills g^(r) > C(x+p) , that is, we need a qualified 
rate in the exponential moment conditions. This is similar to requiring finite polynomial 
moments for the score function. Condition (4.4) requires that r exceeds some fixed 
value, namely, > C(x + p) . This bound is helpful for fixing the value ro providing a 
sensible deviation probability bound. 

If b(r) decreases with r , the result is a bit more involved. The key requirement 
is that b(r) decreases not too fast, so that the product rb(r) grows to infinity with 
r . The idea is to include the complement of the central set ©o in Q in the union of 
the growing sets 6'o(rfe) with b(rfc) > b(ro)2~*^ , and then apply Theorem 4.2 for each 
0o(rfc) . 

Theorem 4.3. Suppose {Er) and (£r) . Let he such that b(rfc) > b(ro)2~'^ for 
k >1 . If the conditions 

1 + + Q + ck < 3z.2g(rfe)/b(rfc), 



6fo\/x + Q + ck < rfcb(rfc), 
are fulfilled for c = log(2) , then it holds 

]P{9 0o(ro)) < e-^ 
Proof. The result (4.5) is applied to each set &o{ri^) and x^ = x + cfc . This yields 
]P{e 0o(ro)) < ^^(^ 0o{rk)) < ^e-^+'='= = e"^ 

k>l k>l 

as required. □ 
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Remark 4.2. Here we briefly discuss the very important question: how one can fix the 
value ro ensuring the bracketing result in the local set Oq^tq) and a small probability of 
the related set Ce(r) from (3.8)? The event {||Vo-D7"'^^g|| > r} requires r > C(x + p) . 
Further we inspect the deviation bound for the complement \ (9o(ro) . For simplicity, 
assume (^Cr) with b(r) = b . Then the most important condition (4.4) of Theorem 4.2 
requires that > Cb~^(x + p) . In words, the squared radius should be of order 
p . The other condition (4.3) of Theorem 4.2 is technical and only requires that g(r) 
is sufficiently large while the local results only require that 6{z) and g{z) are small 
for such r . In the asymptotic setup one can typically bring these conditions together. 
Section 5 provides further discussion for the i.i.d. setup. 



5 Examples 

The model with independent identically distributed (i.i.d.) observations is one of the 
most popular setups in statistical literature and in statistical applications. The essential 
and the most developed part of the statistical theory is designed for the i.i.d. model- 
ing. Especially, the classical asymptotic parametric theory is almost complete including 
asymptotic root-n normality and efficiency of the MLE and Bayes estimators under rather 
mild assumptions; see e.g. Chapter 2 and 3 in Ibragimov and Khas'minskij (1981). So, 
the i.i.d. model can naturally serve as a benchmark for any extension of the statistical 
theory: being applied to the i.i.d. setup, the new approach should lead to essentially 
the same conclusions as in the classical theory. Similar reasons apply to the regression 
model and its extensions. Below we try demonstrate that the proposed non-asymptotic 
viewpoint is able to reproduce the existing brilliant and well established results of the 
classical parametric theory. Surprisingly, the majority of classical efficiency results can 
be easily derived from the obtained general non-asymptotic bounds. 

The next question is whether there is any added value or benefits of the new approach 
being restricted to the i.i.d. situation relative to the classical one. Two important issues 
have been already mentioned: the new approach applies to the situation with finite 
samples and survives under model misspecification. One more important question is 
whether the obtained results remain applicable and informative if the dimension of the 
parameter space is high - this is one of the main challenge in the modern statistics. We 
show that the dimensionality p naturally appears in the risk bounds and the results 
apply as long as the sample size exceeds in order this value p . All these questions are 
addressed in Section 5.1 for the i.i.d. setup. Section 5.2 focuses on generalized linear 
modeling, while Section 5.3 discusses linear median regression. 
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5.1 Quasi MLE in an i.i.d. model 

The basic i.i.d. parametric model means that the observations Y = (Yi, . . . ,Yn) are 
independent identically distributed from a distribution P which belongs to a given para- 
metric family {P0,6 G 0) on the observation space . Each 9 £ clearly yields 
the product data distribution ]Pg = Pf^ on the product space ^ = . This section 
illustrates how the obtained general results can be applied to this type of modeling under 
possible model misspecification. Different types of misspecification can be considered. 
Each of the assumptions, namely, data independence, identical distribution, paramet- 
ric form of the marginal distribution can be violated. To be specific, we assume the 
observations Yi independent and identically distributed. However, we admit that the 
distribution of each Yi does not necessarily belong to the parametric family {Pe) ■ The 
case of non-identically distributed observations can be done similarly at cost of more 
complicated notation. 

In what follows the parametric family (Pe) is supposed to be dominated by a measure 
Ho, and each density p{y,6) = dPg / dfio{y) is two times continuously differentiable in 
6 for all y . Denote i{y,0) = logp{y,6) . The parametric assumption Yi ~ Pg* G (Pg) 
leads to the log-likelihood 

L(0) = ^£(y„0), 

where the summation is taken over i = 1, . . . ,n . The quasi MLE 9 maximizes this sum 
over 0^0: 

6 *== argmaxL(0) = argmax \^ £(1^, 0). 

The target of estimation 6* maximizes the expectation of L{0) : 

e* =^ argmaxiEL(0) = argmaxiE^(yi, 0). 

060 060 

Let Ci(^) =^ t{Yi,e) - m{Yi,e) . Then = ECi(^) • The equation V]EL{e*) = 
implies 

vc(r ) = vo(0*) = E ^^^(^*)- (5-1) 

Li.d. structure of the Yi 's allows to rewrite the local conditions {Er) , (EDq) , 
{EDi) , and (£o) > and (X) in terms of the marginal distribution. 

(edo) There exists a positively definite symmetric matrix vq , such that for all |A| < gi 

sup logiEexp/A^^^^^^i^l < ul\^/2. 
■yeJRp [ l|vo7ll J 
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A natural candidate on Vg is given by the variance of the gradient V£{Yi,6*) , that is, 



In view of Vq = nv^ , it holds &o{t) = Oiod^) with = nu^ . 

Below we distinguish between local conditions for u < uq and the global conditions 
for all u > , where ug is some fixed value. 

The local smoothness conditions (EDi) and (£9) require to specify the functions 
6{r) and ^(r) for r < tq where rg = nug . If the log-likelihood function £{y,6) is 
sufficiently smooth in 9 , these functions can be selected proportional to u = r/n^/^ . 

(edi) There are constants u* > and gi > such that for each u < uq and |A| < gi 



Further we restate the local identifiability condition (£0) in terms of the expected 
value k{0,e*)=^-lE{i{Yi,e)-e{Yi,9*)} for each i. We suppose that k{e,0*) is two 
times differentiable w.r.t. 9. The definition of 9* implies \/]Ei{Yi,9*) = 0. Define 
also the matrix fo = —V^]Ei{Yi,d*) . In the parametric case P = Pq* , k{9,9*) is the 
Kullback-Leibler divergence between Pq* and Pe while the matrices Vg = fg are equal 
to each other and coincide with the Fisher information matrix of the family (Pq) at 9* . 

(io) There is a constant 5* such that it holds for each u < uq 



= Var{V£(yi,0*)} = Var{VCi(0*)} • 



Next consider the local sets 



0loc(u) = {0: ||vo(0-r)|| <u}. 




2k{e,9*) 



1 < S*n. 



e^eZiu) (0 - 0*)Tfo (9 - 9*) 



(t) There is a constant a > such that a^fg > Vg . 



(eu) For each u > 0, there exists gi(u) > 0, such that for all |A| < gi(u) 




(£u) For each u > , there exists b(u) > such that 



sup 
ee0: ||vo{0-0*) 



k{9,9*) 
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Lemma 5.1. Let Yi,...,Yn be i.i.d. Then {en), (edo) , (edi) , (t) , and (io) imply 
(Er), (EDq), {EDi), {!), and (£o) with = nv^ , D"^ = nio , oj{r) = u}*z/n^/'^ , 



Proof. The identities Vq = nvg , Dq = nfo follow from the i.i.d. structure of the 
observations Yi . We briefly comment on condition {Er) . The use once again the i.i.d. 
structure yields by (5.1) in view of Vq = uvq 



Remark 5.1. This remark discusses how the presented conditions relate to what is 
usually assumed in statistical literature. One general remarks concern the choice of the 
parametric family {Pe) ■ The point of the classical theory is that the true measure is 
in this family, so the conditions should be as weak as possible. The viewpoint of this 
paper is slightly different: whatever family {Pe) is taken, the true measure is never 
included, any model is only an approximation of reality. From the other side, the choice 
of the parametric model {Pe) is always done by a statistician. Sometime some special 
stylized features of the model force to include an irregularity in this family. Otherwise 
any smoothness condition on the density i{y, 6) can be secured by a proper choice of 
the family {Pe) ■ 

The presented list also includes the exponential moment conditions {edo) and {edi) 
on the gradient V^(yi, 6) . We need exponential moments for establishing some nonasymp- 
totic risk bounds, the classical concentration bounds require even stronger conditions that 
the considered random variables are bounded. 

The identifiability condition (£u) is very easy to check in the usual asymptotic setup. 
Indeed, if the parameter set is compact, the Kullback-Leibler divergence k{9,9*) 
is continuous and positive for all 9^9*, then {£u) is fulfilled automatically with a 
universal constant b . If is not compact, the condition is still fulfilled but the function 
b(u) may depend on u. 

Below we specify the general results of Section 3 and 4 to the i.i.d. setup. 
5.1.1 A large deviation bound 

This section presents some sufficient conditions ensuring a small deviation probability 
for the event {9 ©100(^0)} for a fixed uq . Below Q = cip . We only discuss the case 
b(u) = b . The general case only requires more complicated notations. The next result 
follows from Theorem 4.2 with the obvious changes. 



6{r) = (^*r/n^/^ , and g = gi^/n . 




as long as A < n^/^gi(u) < g(r) . Similarly for {EDq) and {EDi) . 



□ 
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Theorem 5.2. Suppose (eu) and (in) with b(u) =h.If, for uq > , 

n^/^oh > 6uo^/x + q, (5.2) 
1 + Vx + Q < 3h~^uigi{uo)n^/^, 

then 

f(6^ 01oc(uo)) < e-^ 

Remark 5.2. The presented result helps to qualify two important values uq and n 
providing a sensible deviation probability bound. For simplicity suppose that gi(u) = 
gi > . Then the condition (5.2) can be written as tiuq ^ x + Q . In other words, the 
result of the theorem claims a large deviation bound for the vicinity 6*100(110) with Uq 
of order p/n . In classical asymptotic statistics this result is usually referred to as root-n 
consistency. Our approach yields this result in a very strong form and for finite samples. 

5.1.2 Local inference 

Now we restate the general local bounds of Section 3 for the i.i.d. case. First we describe 
the approximating linear models. The matrices Vq and fo from conditions (edo) , (e^i), 
and {Iq) determine their drift and variance components. Define 

If Te =^ 5 + o?Q < 1 , then 

fe > (1 - Te)fo > 0. 

Further, = nfe and 

The upper bracketing process reads as 

L,(0,r) = {9- e*yD,i, - \\D,{e - e*)f/2. 

This expression can be viewed as log-likelihood for the linear model = D^^O + e for a 
standard normal error e . The (quasi) MLE 0£ for this model is of the form 6^ = D^^^^ . 

Theorem 5.3. Suppose (edg) . Given uq , assume (edi) , {^o) , and (l) on ©100(^0)1 
and let g = 3fo ct;*uo i ^ = 5*iio > ^^^d =^ d + a^g < 1 . Then the results of Theorem 3.1 
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and all its corollaries apply to the case of i.i.d. modeling with Tq = tt-Uq . In particular, 
on the random set Cg(ro) = {0 G 0100(^0), < tq} , it holds 

U£/2 - Oe(ro) < L{e, r ) < 11^,11 2/2 + Oe(ro), 

\\^/^e{e - e*) - < 2z^,(ro). 

The random quantities Oe(ro) CLnd A^{i:q) follow the probability bounds of Theorem 3.7 
and 3.10. 

Now we briefly discuss the implications of Theorem 5.2 and 5.3 to the classical asymp- 
totic setup with n — )■ 00 . We fix Uq = Cp/n for a constant C ensuring the deviation 
bound of Theorem 5.2. Then 5 is of order uq and the same for q . For a sufficiently 
large n, both quantities are small and thus, the spread Z\g(ro) is small as well; see 
Section 3.3. 

Further, under (edo) condition, the normalized score 

is zero mean asymptotically normal by the central limit theorem. Moreover, if fo = Vq , 
then ^ is asymptotically standard normal. The same holds for . This immediately 
yields all classical asymptotic results like Wilks theorem or the Fisher expansion for MLE 
in the i.i.d. setup as well as the asymptotic efficiency of the MLE. Moreover, our results 
bounds yield the asymptotic result for the case when the parameter dimension p = Pn 
grows linearly with n . Below u„ = 0n{pn) means that Un/pn with n . 

Theorem 5.4. LetYi,...,Yn be i.i.d. Pq* and let (edo) , {edi) , (io) , (l) , (eu) , and 
(iu) with b(u) = b hold. If n > Cpn for a fixed constant C depending on constants in 
the above conditions only, then 

\\^^{e - e*) - £,f = oniPn), 2L(e, e*) - = o„(p„). 

This result particularly yields that \JniQ(6 — 6*) is nearly standard normal and 
2L(e,e*) is nearly Xp • 

5.2 Generalized linear modeling 

Now we consider a generalized linear modeling (GLM) which is often used for describ- 
ing some categorical data. Let IP = (Pw,w G T) be an exponential family with a 
canonical parametrization; see e.g. McCullagh and Nelder (1989). The corresponding 
log-density can be represented as i{y, w) = yw — d{w) for a convex function d{w) . 
The popular examples are given by the binomial (binary response, logistic) model with 
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d{w) = log(e"' + l) , the Poisson model with d[w) = , the exponential model 
with d[w) = — log(w;) . Note that linear Gaussian regression is a special case with 
d{w) = . 

A GLM specification means that every observation Yi has a distribution from the 
family 7 with the parameter Wi which linearly depends on the regressor G IRP : 

Yi ~ P^Tg* . (5.3) 

i 

The corresponding log-density of a GLM reads as 

L{e) = Y,{y^^7o-d{wJ6)]. 

Under Pq* each observation Yi follows (5.3), in particular, JEYi = d'i^JO*). How- 
ever, similarly to the previous sections, it is accepted that the parametric model (5.3) 

dcf 

is misspecified. Response misspecification means that the vector / = JEY cannot be 
represented in the form d'{\I^~^6) whatever 6 is. The other sort of misspecification con- 
cerns the data distribution. The model (5.3) assumes that the Yi 's are independent and 
the marginal distribution belongs to the given parametric family T . In what follows, 
we only assume independent data having certain exponential moments. The target of 
estimation 0* is defined by 

e* =^argmaxiEL(0). 


The quasi MLE 9 is defined by maximization of L(0) : 

6 = argmaxL(0) = aiguiaxy^ {Yi^ J 9 - d{^79)}. 


Convexity of d{-) implies that L{9) is a concave function of 9 , so that the optimization 
problem has a unique solution and can be effectively solved. However, a closed form 
solution is only available for the constant regression or for the linear Gaussian regression. 
The corresponding target 9* is the maximizer of the expected log-likelihood: 

9* = argmay: ]EL{9) = argmax Vj/itf^^ - d{'l'j9)} 


with fi = lEYi . The function ]EL{9) is concave as well and the vector 9* is also well 
defined. 

Define the individual errors (residuals) £i = Yi — JEYi . Below we assume that these 
errors fulfill some exponential moment conditions. 
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(ei) There exist some constants z/q and gi > , and for every i a constant Sj such 
that JEi^Ei/Sif' < 1 and 

log IE e^p{Xei/5^) < |A| < gi. (5.4) 

A natural candidate for Sj is ai where af = Eej is the variance of 
Lemma B.17. Under (5.4), introduce a. p x p matrix Vq defined by 

vr=T.^lm7. (5.5) 

Condition (ei) effectively means that each error term = 1^ — ]EYi has some bounded 
exponential moments: for |A| < gi , it holds /(A) =^ log exp(Aej/Si) < oo . This 
implies the quadratic upper bound for the function /(A) for |A| < gi ; see Lemma B.17. 
In words, condition (ei) requires light (exponentially decreasing) tail for the marginal 
distribution of each Si . 
Define also 

N-y^ "^^^ max sup '-fl^ ■ (5-6) 
« jeMp \\Voj\\ 

Lemma 5.5. Assume (ei) and let Vq be defined by (5.5) and N by (5.6). Then condi- 
tions {EDq) and {Er) follow from (ei) with the matrix Vq due to (5.5) and g = giN^^'^ . 
Moreover, the stochastic component ("(0) is linear in and the condition {EDi) is ful- 
filled with ci;(r) = . 

Proof. The gradient of the stochastic component C(0) of L{0) does not depend on , 
namely, VC(0) = '^^iEi with £i = Yi — JEYi . Now, for any unit vector 7 € JW and 
A < g , independence of the Ei 's implies that 

'"^i p^^' ^ } = E log ^ 

By definition Si|l^j^7|/|| Vb7|| < iV~^/^ and therefore, Asi|l^7^7|/|| Vb7|| < gi . Hence, 
(5.4) implies 

and {EDq) follows. □ 

It only remains to bound the quality of quadratic approximation for the mean of the 
process L{6, 6*) in a vicinity of 0* . An interesting feature of the GLM is that the effect 
of model misspecification disappears in the expectation of L{6, 0*) . 
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Lemma 5.6. It holds 

-iEL{e, e*) = "^{di^Je) - d{^Je*) - d'{^Je*)^J{e - e*)} 

= X{P0*,lPe), (5.8) 

where %{^1Pq* , IPg^ is the Kullback-Leibler divergence between measures JPg* and dPg . 
Moreover, 

-iEL{e, e*) = \\D{o°){e - r)|| V2, (5.9) 

where 6° e [e*,e] and 

D^{e°) = ^d"{^Je°)^i^J . 

Proof. The definition implies 

]EL{e,e*) = Y,{n'^7io - 0*) - d{yfje) + d{^Je*)]. 

As e* is the extreme point of ]EL{e) , it holds VIEL{e*) = Y,[fi-d'{Wle*)]Wi = and 

(5.8) follows. The Taylor expansion of the second order around 6* yields the expansion 

(5.9) . □ 

Define now the matrix Dq by 

=^ D'^ie*) = ^d"{wje*)^i^j . 

Let also Vq be defined by (5.5). Note that the matrices Dq and Vq coincide if the model 
Yi ~ Pq,TQ* is correctly specified and s? = d"{^J 0*) . The matrix Vq describes a local 

i 

elliptic neighborhood of the central point 0* in the form ©q^t) = {6 : \\Vq{6 — 9*)\\ < 
r} . If the matrix function D^{6) is continuous in this vicinity 0o(r) then the value 
6{r) measuring the approximation quality of —]EL{9,9*) by the quadratic function 
||Z)o(^ — ^*)lP/2 is small and the identifiability condition (£o) is fulfilled on 00(1") . 

Lemma 5.7. Suppose that 

- D-'D^{e)D-^\\^ < 6{t), e e 0o(r). (5.10) 

Then (£o) holds with this (5(r) . Moreover, as the quantities uj{z),(}^{z),(}^{r) vanish, 
one can take q = leading to the following representation for and : 

Dl = {l-6)Dl = (1 + 5)1/2^ 

Dl = {l + 6)Dl ^, = (1-5)1/2^ 
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with 

Linearity of the stochastic component C(^) hi the considered GLM imphes the impor- 
tant fact that the quantities Oe(r),Oe(r) in the general bracketing bound (3.4) vanish 
for any r . Therefore, in the GLM case, the deficiency can be defined as the difference 
between upper and lower excess and it can be easily evaluated: 

Z\(r) = ||^,||V2-||^,||V2 = 5||^f. 

Our result assumes some concentration properties of the squared norm of the 

vector ^ . These properties can be established by general results of Section A under the 
regularity condition: for some a 

Vo < aDo. (5.11) 
Now we are prepared to state the local results for the GLM estimation. 
Theorem 5.8. Let (ei) hold. Then for 6 > 5(r) any z > and 3 > 0, it holds 

ip{\\Do{6 - r) II > z, \\Vo{e - r) || < r) < ]P{ > (i - S)z^] 
ip{L(e,e*)>i, ||yo(^-^*)ll <r) < iP{||^||V2 > (i -5)3}. 

Moreover, on the set Ce(r) = {||Vb(5 - 6*)\\ < r, ||^g|| < r} , it holds 

\\Do(e-e*)-a^<-^Uf. (5.12) 

If the function d{w) is quadratic then the approximation error 5 vanishes as well and 
the expansion (5.12) becomes equality which is also fulfilled globally, a localization step 
in not required. However, if d{w) is not quadratic, the result applies only locally and 
it has to be accomplished with a large deviation bound. The GLM structure is helpful 
in the large deviation zone as well. Indeed, the gradient V(^(0) does not depend on 9 
and hence, the most delicate condition {Er) is fulfilled automatically with g = giN^^'^ 
for all local sets Oo{z) . Further, the identifiability condition (£r) easily follows from 
Lemma 5.6: it suffices to bound from below the matrix D{0) for 6 G 6'o(r) : 

D{e) > h{T)Vo, e G 0o(r). 

An interesting question, similarly to the i.i.d. case, is the minimal radius rg of the 
local vicinity ©0(^0) ensuring the desirable concentration property. Suppose for the 
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moment that the constants b(r) are all the same for different r : b(r) = b . Under the 
regularity condition (5.11), a sufficient lower bound for rg can be based on Corollary 4.3. 
The required condition can be restated as 

1 + Vx + Q < 3z^^g/b, 6i/oVx + Q < rb. 

It remains to note that Q = cip and g = giN^^"^ . So, the required conditions are 
fulfilled for > Tq = C(x + p) , where C only depends on z/q, b , and g . 

5.3 Linear median estimation 

This section illustrates how the proposed approach applies to robust estimation in linear 
models. The target of analysis is the linear dependence of the observed data Y = 
{Yi,...,Yn) on the set of features <Fi £ ]RP : 

Y, = Wje + ei, (5.13) 

where means the ith individual error. As usual, the true data distribution can 
deviate from the linear model. In addition, we admit contaminated data which naturally 
leads to the idea of robust estimation. This section offers a qMLE view on the robust 
estimation problem. Our parametric family assumes the linear dependence (5.13) with 
i.i.d. errors £i which follow the double exponential (Laplace) distribution with the 
density (l/2)e~l'^l . Then the corresponding log-likelihood reads as 

and '= argmaxg L{0) is called the least absolute deviation (LAD) estimate. In the 
context of linear regression, it is also called the linear median estimate. The target of 
estimation 9* is defined as usually by the equation 9* = argmax^ ]EL{6) . 
It is useful to define the residuals e » = 1^ — '1/J 6* and their distributions 

Pi{A) = lP{ei gA)= 1P{Y, - '^Je* G A) 

for any Borel set A on the real line. If 1^ = 6* + is the true model then Pi 
coincides with the distribution of each Sj . Below we suppose that each Pi has a positive 
density fi{y) . 

Note that the difference L(0) - L(0*) is bounded hy \ Yj - ^*) \ ■ Next we 

check conditions {EDq) and [EDi) . Denote ii{9) = ^Yi - I^J 9 < 0) - qi{9) for 
qi{9) = IPiYi — 9 < 0) . This is a centered Bernoulli random variable, and it is easy 
to check that 



VQ{9) = -Y,^^{.m■ 



(5.14) 
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This expression differs from the similar ones from the hnear and generaHzed hnear regres- 
sion because the stochastic terms now depend on 6 . First we check the global condi- 
tion (Er) . Fix any gi < 1 . Then it holds for a Bernoulli r.v. Z with ]P{Z = 1) = q , 
^ = Z -q, and |A| < gi 

log IE exp{XC) = log[gexp{A(l-g)} + (l-g)exp(-Ag)] 

< uiq{l - g)AV2, (5.15) 

where i^o > 1 depends on gi only. Let now a vector 7 £ ]RP and p > be such that 
P\^7f \ — Si i = 1, . . . ,n . Then 

2 2 

logiEexp{p7''VC(0)} < ^ J]g.(0){l-(/.(0)}|<Z'77l' 

< ^Iim7f , (5.16) 

where 

v\e) = Y,q^{0){^-q^{0)}m7■ 

Denote also 

Clearly V{6) < Vq for all 9 and condition (Ez) is fulfilled with the matrix Vq and 
g(r) = g = giAfi/2 for ^ defined by 

iV = max sup ; ; (5.18) 

cf. (5.7). 

Let some ro > be fixed. We will specify this choice later. Now we check the 
local conditions within the elliptic vicinity ©0(^0) = ■ ll^(^ ~ ^*)ll ^ ^0} of the 
central point 9* for Vq from (5.17). Then condition (EDq) with the matrix Vq and 
g = N^/'^gi is fulfilled on 0o(ro) due to (5.16). Next, in view of (5.18), it holds 
< 2iV-i/2||Vb7|| for any vector j e JRP . By (5.14) 

vc{9)-va9*) = Y,M^^i^)-^^i^*)}■ 

If tf^e > ^^9* , then 

Ci{9) - Ci{9*) = ^^^9* <Y,< ^J9) - 1P{W79* <Y,< ^J9). 



SPOKOINY, V. 33 

Similarly for "^J 6 < >pj9* 

ii{e) - ^i{e*) = - li^^e <Yi< ^Je*) + ]p{^Je <y,< ^Je*). 

Define qi{e,e*) =^ \qi{e) - qi{e*)\ . Now (5.15) yields similarly to (5.16) 

logiEexp{p7T{VC(0) - VC(r)}} < ^ J^cz,(0,r)|^r7l' 
< 2i.oVmax(?,(0,r) < oj{r)uip^Voj\\' /2, 

i<n 

with 

c<j(r) 4max sup qi{6,0*). 

i<n 0e©o(r) 

If each density function pi is uniformly bounded by a constant C then 

\qi{e) - qii9*)\ < c\^l{e -e*)\< CN^^/^\\Vo{e - e*)\\ < CN'^/^r. 

Next we check the local identifiability condition. We use the following technical 
lemma. 

Lemma 5.9. It holds for any 6 



,2 



g^^mo) = D\e) j;p,(i^^(0 - r))a^,a^7, (5.19) 

where fi{-) is the density of 'Si = Yi — 6* . Moreover, there is 6° G [0,0*] such that 

-iEL{e,e*) = lY.\q,j{e-e*)\^f,{i^J{o°-e*)) 

= [0- 0*yD^{0°){e - e*)/2. (5.20) 

Proof. Obviously 

The identity (5.19) is obtained by one more differentiation. By definition, 0* is the 
extreme point of IEL{0) . The equality V1EL{6*) = yields 

^ "^^^*) - V2}'^i = 0. 
Now (5.20) follows by the Taylor expansion of the second order at 0* . □ 
Define 

Dl''^Y.\'^J{o-e*)\''fM- (5.21) 
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Due to this lemma, condition (£o) is fulfilled in 00(1") with this choice Dq for 6{r) 
from (5.10); see Lemma 5.7. Moreover, if /i(0) > o^/4 for a > 0, then the identifiability 
condition (X) is also satisfied. Now all the local conditions are fulfilled yielding the 
general bracketing bound of Theorem 3.1 and all its corollaries. 

It only remains to accomplish them by a large deviation bound, that is, to specify the 
local vicinity 0o(ro) providing the prescribed deviation bound. A sufficient condition 
for the concentration property is that the expectation ]EL{0, 0*) grows in absolute value 
with the distance ||Vo(0 — 0*)|| . We use the representation (5.19). Suppose that for some 
fixed 5 <l/2 and p > 

\fi{u)/fi{Q) - 1| < <5, \u\ < p. (5.22) 
For any 9 with ||Vo(0 — = r > ro , and for any i = 1, . . . , n , it holds 

\^l{e - e*)\ < N~'^/^\\Vo{e - e*)\\ = N^^/'^r. 

Therefore, for r < pN^/^ and any 6 G 0o(r) with \\Vo{e-e*)\\ = r , it holds fi{^J{e°- 
e*)) > (1 - 5)/i(0) . Now Lemma 5.9 implies 

-iEL{e, e*) > ^-^m{e - e*)f > ^^\me - e*)f = 

By Lemma 5.9 the function —]EL{6,0*) is convex. This easily yields 

-lEL{e,e*) > \zApN'/^r 
lor 

for all r > pN^I'^ . Thus, 

f(l-5)(2a2)-ir ifr<piVi/2, 
rb(r) > < 

\(\-S)(2a^Y^pN^I'^ ifr>/)iVi/2. 

So, the global identifiability condition (£1) is fulfilled if Tq > Cia^(x + Q) and if 
p^N > C2a^(x + Q) for some fixed constants Ci and C2 . 
Putting all together yields the following result. 

Theorem 5.10. Let Yi he independent, 0* = argmaxg ]EL{0) , Dq be given by (5.21), 
and Vq by (5.17). Let also the densities fi{-) of Yi — 0* be uniformly bounded by 
a constant C , fulfill (5.22) for some p > and 5 > 0, and fi{0) > 0^/4 for all i. 
Finally, let N > C2p~^a^(x + p) for some fixed x > and C2 ■ Then on the random 
set of probability at least 1 — e~^ , one obtains for ^ *== Dq^V L{6*) the bounds 

11/^(0 _ r) - = o(p), 2L{e, e*) - uf = o{p). 
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A Deviation probability for quadratic forms 

The approximation results of the previous sections rely on the probability of the form 
-^(11^ II > y) fo'^ ^ given random vector ^ G . The only condition imposed on this 
vector is that 

logiEexp(7T^) < i^o'll7llV2, 7 e ^^ IItII < g- 
To simplify the presentation we rewrite this condition as 

logiEexp(7'^^) < II7IIV2, 7 e SiP, II7II < g. (A.l) 
The general case can be reduced to 1^0 = 1 by rescaling ^ and g : 

log JEexp (7^^/1/0) < II7IIV2, 7 e SF, II7II < i/og 

that is, i^'q^^ fulfills (A.l) with a slightly increased g. In typical situations like in 
Section 5, the value g is large (of order root-n) while the value vq is close to one. 

A.l Gaussian case 

Our benchmark will be a deviation bound for ||^|p for a standard Gaussian vector ^. 
The ultimate goal is to show that under (A.l) the norm of the vector ^ exhibits behavior 
expected for a Gaussian vector, at least in the region of moderate deviations. For the 
reason of comparison, we begin by stating the result for a Gaussian vector ^ . 

Theorem A.l. Let be a standard normal vector in ]RP . Then for any n > 0, it 
holds 

IP{Uf>P + u) < exp{-ip/2)(Piu/p)]} 

with 

^{t)''^'t-log{l+t). 
Let (p'^i') stand for the inverse of (p{-) . For any x, 

>P + r\2x/p)) < exp(-x). 
This particularly yields with x. = 6.6 



^(ll^lP > p + (xx)) < exp(-x). 
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Proof. The proof utilizes the following well known fact: for fi < 1 

logiE;exp(/x||^||V2) = -0.5plog(l -/i). 

It can be obtained by straightforward calculus. Now consider any n > . By the 
exponential Chebyshev inequality 

lP{Uf>P + u) < exp{-/i(p + n)/2}iEexp(/i||^||V2) (A.2) 
= exp{-/i(p + n)/2-(p/2)log(l-/i)}. 

It is easy to see that the value = u/{u + p) maximizes //(p + u) + plog(l — fi) w.r.t. 
II yielding 

+ u) -plog(l - fi) = u - plog(l + u/p). 

Further we use that x — log(l + x) > aox^ for x <1 and x — log(l + x) > agx for x > 1 
with oq = 1 — log(2) > 0.3 . This implies with x = u/p for u = y^>?xp or li = xx and 
X = 2/ao < 6.6 that 

IP {W^W'^ > p+ ^/>^xpy {xx)) < exp(-x) 
as required. □ 

The message of this result is that the squared norm of the Gaussian vector ^ con- 
centrates around the value p and the deviation over the level p + y^xp are exponentially 
small in x . 

A similar bound can be obtained for a norm of the vector iB^ where IB is some 
given matrix. For notational simplicity we assume that JB is symmetric. Otherwise one 
should replace it with {IB^IBy^'^ . 

Theorem A.2. Let ^ be standard normal in ]RP . Then for every x > and any 
symmetric matrix IB , it holds with po = tr(iB^) , = 2tr(JB^) , and a* = ||JB^||oo 

exp(— x). 

Proof. The matrix IB^ can be represented as diag{ai, . . . ,ap)U for an orthogonal 
matrix U . The vector ^ = is also standard normal and = ^ UIB^U^ ^ . 

This means that one can reduce the situation to the case of a diagonal matrix IB^ = 
diag(ai, . . . , Up) . We can also assume without loss of generality that ai > 02 > . . . > flp . 
The expressions for the quantities po and simplifies to 

Po = tr(iB^) = ai + . . . + Qp, 

v2 = 2tr(iB^) = 2{al + ... + al). 
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Moreover, rescaling the matrix JB^ by ai reduces the situation to the case with ai = 1 . 
Lemma A. 3. It holds 

lE\\lB$f = tr(iB2), Var(||iB^f ) = 2tr(JB^). 

Moreover, for /i < 1 

p 

iEexp{/i||iB^||V2} = det(l - fiJB^y^^"^ = JJ(1 - /xaO"^/^. (A.3) 

i=l 

Proof. If IB^ is diagonal, then = X^j ^i^f the summands Oi^^f are indepen- 

dent. It remains to note that E{ai^f) = ai , Vav{aiS,f) = 2af , and for fiai < 1 , 

yielding (A.3). □ 
Given u , fix /x < 1 . The exponential Markov inequality yields 



iP{l|2B£|P > po + u) < exp{-^*!^}2Eexp(*|M_) 

1 ^ 



We start with the case when x^/^ < „/3_ ^hen u = 2x^/'^v fulfills u < 2^2/3. Define 
H = u/v^ < 2/3 and use that t + log(l - t) > -t^ for t < 2/3 . This implies 

lP{\\mf>Po + u) 

^ exp{-^ + iX:A?} =exp(-^V(4v^)) =e-^ (A.4) 

i=l 

Next, let x"^/^ > w/3 . Set fi = 2/3 . It holds similarly to the above 

p p 
^[//ai + log(l-/iai)] > -^/i^a,^ > -2vV9 > -2x. 

i=l i=l 

Now, for n = 6x and /Uii/2 = 2x , (A.4) implies 

IP{\\]B$f >Po + u) < exp{-(2x - x)} = exp(-x) 

as required. □ 
Below we establish similar bounds for a non-Gaussian vector ^ obeying (A.l). 
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A. 2 A bound for the £2 -norm 

This section presents a general exponential bound for the probability iP(||^|| > y) under 
(A.l). Given g and p, define the values wq = gp^^^^ and Wc by the equation 

Wc{l+Wc) _i/2 

(1 + ^2)1/2 ="o = gP /. (A.5) 
It is easy to see that wq / \f2 < Wc < wq . Further define 

=^ 0.5p[wl - log(l + wl)] . (A.6) 

Note that for > p , the quantities jc and Xc can be evaluated as y^ > w'^p > g^/2 
and Xc>pw^j2 > gV4. 

Theorem A. 4. Let ^ G IR^ fulfill (A.l). Then it holds for each x < Xc 

]P{Uf >p + V^V(xx),||^|| < yc) < 2exp(-x), 

where x = 6.6 . Moreover, for y > yc , it holds with gc = g — ^/fJ-cP = S'^c/ (1 + Wc) 

iP(||^|| >y) < 8.4exp{-g,y/2-(p/2)log(l-g,/y)} 
< 8.4exp{-xc - gc(y - yc)/2}. 

Proof. The main step of the proof is the following exponential bound. 
Lemma A.5. Suppose (A.l). For any fi < 1 with g^ > pfi , it holds 

iE;exp(^) 1(11^11 < g/f, - < 2(1 - ^)-P/2. (A.7) 

Proof. Let £ be a standard normal vector in and u € IRP . The bound iP(||£|p > 
p) ^ 1/2 implies for any vector u and any r with r > ||u|| that J-'(||it + e|| < 

r) > 1/2. Let us fix some ^ with ||^|| < g//i — denote by IP^ the conditional 

probability given ^ . It holds with Cp = (27r) '^^"^ 

/II l|2 
exp(7^^-^) 1(11711 < g)d7 

= c,exp(/.||^||V2) I exp(-i||;.-V2^_^i/2^||2) 1(^,-^2 hH < ^.-^/^g)rf7 

= ^^/2exp(/.||^||V2)iP^(||£ + /i^/^^|| </i-^/^g) 
> 0.5/iP/2exp(/i||^||V2), 
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because ||/^"^^^^|| < "^^^g- This implies in view of p < g^//i that 

exp(/z||^||V2) Hdl^f < g//i - V^) 
r II I|2 

Further, by (A.l) 

CpJE J exp(7"^^ - ^hf ) < Z)dl 

<c,j exp(-^^I^||7f ) 1(11711 < S)dl 

< Cp j exp(^- ^ ^ '^ 117^)^7 

< _ l)-P/2 

and (A. 7) follows. □ 

Due to this result, the scaled squared norm /i||^|p/2 after a proper truncation pos- 
sesses the same exponential moments as in the Gaussian case. A straightforward impli- 
cation is the probability bound JP(||^|p > p + u) for moderate values u. Namely, given 
u > , define /x = u/[u + p) ■ This value optimizes the inequality (A. 2) in the Gaussian 
case. Now we can apply a similar bound under the constraints ||^|| < g/^ — y^p/fJ- ■ 
Therefore, the bound is only meaningful if ^/u + p < g//i — ^/p/Jj- with = u/{u + p) , 
or, with w = sj u/p < Wc ', see (A. 5). 

The largest value u for which this constraint is still valid, is given by p + u = y'^ . 
Hence, (A. 7) yields for p + u <y'^ 

lP(Uf- > P + u,\\(\\<yc) 

< exp{-^*^}Eexp(^i«|«!) Il(||«|| < g/, - y^T^) 

< 2 exp{— 0.5 [fi{p + u) + plog(l — ;u)] } 
= 2exp{— 0.5[u — plog(l + u/p)] }. 

Similarly to the Gaussian case, this implies with x = 6.6 that 

P{U\\ >P + V^V {^x),U\\ <y,) < 2exp(-x). 

The Gaussian case means that (A.l) holds with g = oo yielding yc = oo . In the non- 
Gaussian case with a finite g , we have to accompany the moderate deviation bound with 
a large deviation bound iP(||^|| > y) for y > yc . This is done by combining the bound 
(A. 7) with the standard slicing arguments. 
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Lemma A.6. Let < g^/p . Define yo = g//Uo - y^p/fJ-o and go = Hoyo = g - ^/JM)P ■ 
It holds for y > yo 

lP{m\ > y) < 8.4(1 - go/y)-P/' exp(-goy/2) (A.8) 
< 8.4exp{-xo - go(y - yo)/2}. (A.9) 

with xq defined by 

2x0 = AtoYo +plog(l - Ho) = g^//xo -p + plog(l - /xo)- 

Proof. Consider the growing sequence y^ with yi = y and goy^+i = goy + k . Define 
also /ifc = go/yfc . In particular, fik < fJ-i = go/y • Obviously 

oo 

^(ll^ll>y) = E^(ll^ll>y'^''ll^ll^yfc+i)- 

k=l 

Now we try to evaluate every slicing probability in this expression. We use that 

2 (goy + k-lf ^ 
l^k+iYk = > goy + A: - 2, 

goy + ^ 

and also g/ fik - y^p/Jik > Yfc because g - go = ^//z^ > ^///^ and 
g/z^fc - Vp/Vfc - Yfc = fJ-k^is - \/lm> - go) > 0. 

Hence by (A. 7) 

oo 

iP(||^|| > y) < j;iP(||^|| > YA., Il^ll < Y-t+i) 

k=l 

< Eexp(-^)iE^exp(^^^) Idl^ll < y,,0 
fc=i 

oo 2 

< 2^2(1 -/ifc+i) exp(^ — j 

fc=i 

^ o/i \-p/2v^ / goY + A;-2\ 
<2(l-/xi) 2^exp(^ ^ j 

fe=i 

= 2eV2(l - e-V2)-i(i _ /.i)"^/' exp(-goY/2) 

< 8.4(1 -^i)-?'/2exp(-goy/2) 

and the first assertion follows. For y = yo , it holds 

goYo +plog(l - ^o) = AtoYo +Plog(l - /^o) = 2xo 
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and (A. 8) implies iP(||^|| > yo) < 8.4exp(— xq) . Now observe that the function /(y) = 
goy/2 + (p/2)Iog(l - go/y) fulfills /(yo) = xq and /'(y) > go/2 yielding /(y) > 
xo + go(y - yo)/2 . This implies (A.9). □ 

The statements of the theorem are obtained by applying the lemmas with ^uo = = 
wl/{l + wl). This also implies yo = yc , xo = Xc , and go = gc = g-^/^; cf. (A.6). □ 

The statements of Theorem A. 8 can be simplified under the assumption > p ■ 

Corollary A. 7. Let ^ fulfill (A.l) and > p . Then it holds for x < Xc 

]P{Uf>i{x,p)) < 2e^^ + 8.4e-^% (A.IO) 



, . act x<p/x, 

3(x,p) = <; (A.ll) 
p+xx p/>i:<x<Xc, 



with K = 6.6 . For x > Xc 

Pi\\if > 3c(x,p)) < 8.4e-^ 3,(x,p) \y, + 2(x - x,)/ge|'. 

This result implicitly assumes that p < xx^ which is fulfilled if Wq = g'^/p > 1 : 

xxc = 0.5x[wl - log(l + wl)]p > 3.3[l - log(2)]p > p. 

In the zone x < p/>c we obtain sub-Gaussian behavior of the tail of — p, in the 
zone p/k < X < Xc it becomes sub-exponential. Note that the sub-exponential zone is 
empty if g^ < p . 

For X < Xc , the function 3(x,p) mimics the quantile behavior of the chi-squared 
distribution Xp with p degrees of freedom. Moreover, increase of the value g yields a 
growth of the sub-Gaussian zone. In particular, for g = oo , a general quadratic form 
II^P has under (A.l) the same tail behavior as in the Gaussian case. 

1/2 

Finally, in the large deviation zone x > Xc the deviation probability decays as e~'^^ 
for some fixed c. However, if the constant g in the condition (A.l) is sufficiently large 
relative to p , then Xc is large as well and the large deviation zone x > Xc can be ignored 
at a small price of 8.4e"^'' and one can focus on the deviation bound described by (A.IO) 
and (A.ll). 

A. 3 A bound for a quadratic form 

Now we extend the result to more general bound for = IB^^ with a given 

matrix IB and a vector ^ obeying the condition (A.l). Similarly to the Gaussian case 
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we assume that JB is symmetric. Define important characteristics of JB 

po = tr(iB2), v2 = 2tT{lB% X* \\1B^^ XmU^B^)- 

For simphcity of formulation we suppose that A* = 1 , otherwise one has to replace po 
and with po/A* and v^/A* . 

Let g be shown in (A.l). Define similarly to the £2 -case Wc by the equation 

Wc{l + Wc) _ -1/2 

(1 + w;2)i/2 - ■ 

Define also ^ic = w1 / {\ + w1) /\2 / . Note that w1>2 implies /Uc = 2/3. Further define 

y2 = (1 + w;2)po, 2xe = ^icfc + log det{/p - ^lcB'^]. (A.12) 

Similarly to the case with IB = Ip, under the condition > po , one can bound 
y2 >g2/2 and x,>gV4. 

Theorem A. 8. Let a random vector ^ in ]RP fulfill (A.l). Then for each x < Xc 

IP{\\lB$,f > po + (2vx^/2) V (6x), IliB^II < Yc) < 2exp(-x). 

Moreover, for y >yc , with gc = g — ^/^J'cPo = gWc/ (1 + Wc) , it holds 

P{\m\\ > y) < 8.4exp(-xe - gc{y - yc)/2). 

Proof. The main steps of the proof are similar to the proof of Theorem A. 4. 
Lemma A. 9. Suppose (A.l). For any /j, < 1 with g^//i > po ? it holds 

]Eexp{n\\]B(,f/2) 11(11^^^11 < g/fJ, - a/po//u) < 2det(/p - ^iB'^)'^/'^. (A.13) 
Proof. With Cp{B) = (27r)"^/^det(iB-i) 

Cp{B) j exp(7^^ - 1-\\B"H?) < g)dl 

= c,(iB)exp(^^^) j exp(-i||^V2jB^_^-i/2jB-i^||2^ < g)^^ 

= /.^/2exp(^^M)^^(||^-i/2^, + 5^2^11 < g/;,), 

where e denotes a standard normal vector in IRP and iP^ means the conditional prob- 

1/2 

ability given ^. Moreover, for any u G IRP and r > Pq + , it holds in view of 

iP(||iB£||2 > po) < 1/2 

P{\\Bs - u\\ < r) > iP(||^e|| < ^Po) > 1/2- 
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This implies 



Further, by (A.l) 

Cp{lB)lE J exp(7^^ - i-||JB-i7||2) < g)dj 



f II iP 1 
< Cp{B) J exp(^ - —\\]B~'^fy-r 

and (A. 13) foUows. □ 
Now we evaluate the probability JP(||JB^|| > y) for moderate values of y. 



Lemma A. 10. Let hq < lA(g^/po) . With yo = g//^o— vPo/Vo > it holds for any u> 
lP{\\lB^f >Po + u,\\IBm\ <yo) 

< 2exp{-0.5/zo(po + ^i) -0.51ogdet(/p-/ioiB2)}. (A.14) 
In particular, if JB^ is diagonal, that is, JB^ = diag(ai, . . . ,ap) , then 

JP(||S^f >po + n,||iB2^|| <yo) 
1 ^ 

< 2exp{-^--J^[/xoai + log(l-/ioa,)]}. (A.15) 

1=1 

Proof. The exponential Chebyshev inequality and (A. 13) imply 
iP(||iB^f >po + n,||iB2^|| <yo) 

< exp{-^^^^^P|±^}iEexp(^^^^) ^\\bH\\ < g/^o - V^o) 

< 2exp{-0.5/xo(po + u) - 0.51ogdet(/p - ^iqIB'^)}. 

Moreover, the standard change-of-basis arguments allow us to reduce the problem to the 
case of a diagonal matrix IB^ = diag(ai, . . . , Op) where 1 = ai > 02 > . . . > Op > . 
Note that po = ai + . • •+ap • Then the claim (A.14) can be written in the form (A.15). □ 

Now we evaluate a large deviation probability that > y for a large y. Note 

that the condition ||JB^||oo < 1 implies ||iB^^|| < . So, the bound (A.14) continues 

to hold when ||iB^^|| < yo is replaced by < yo • 
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Lemma A. 11. Let fiQ < 1 and /UoPo < • Define go by go = g — y^/xopo • For any 
y > yo =^ go/A^o , it holds 

lP{\m\\ > y) < 8.4det{/p- (go/y)iB2}-i/2gxp(-goy/2). 

< 8.4exp(-xo-go(y-yo)/2), (A.16) 

where xq is defined by 

2x0 = goyo + logdet{/p - (go/yo)-^^}. 

Proof. The slicing arguments of Lemma A. 6 apply here in the same manner. One has 
to replace ||^|| by ||iB^|| and (1 - /ii)"^/^ i^y det{/p - (go/y)iB2}-V2 . We omit the 
details. In particular, with y = yo = go//^ ) this yields 

1P{\\B^\\ >yo) < 8.4exp(-xo). 

Moreover, for the function /(y) = goy + logdet{/p — (go/y)-/B^} , it holds /'(y) > go 
and hence, /(y) > /(yo) + go(y - yo) for y > yo • This implies (A.16). □ 

One important feature of the results of Lemma A. 10 and Lemma A. 11 is that the 
value /io < lA(g^/po) can be selected arbitrarily. In particular, for y > yc , Lemma A. 11 
with /Xq = /^c yields the large deviation probability JP(||JB^|| > y) . For bounding the 
probability iP(||iB^||2 >Po + u, \\]B^\\ < y^) , we use the inequality log(l — t) > —t — t^ 
for t < 2/3 . It implies for /i < 2/3 that 

-logiP(||iB^f >po + n,||iB^|| <y,) 

p 

> /i(po + u) + ^log(l - ^j,ai) 

1=1 

V 

> /i(po + u) - ^(/ucj + /i^af ) > jJ-u- /i^v^/2. 

1=1 

Now we distinguish between fic = 2/3 and fj-c < 2/3 starting with fj.c 
bound (A. 17) with fi = 2/3 and with u = V (6x) yields 

]P{\\]B^f >po + u, ll^^ll <yc) < 2exp(-x); 

see the proof of Theorem A. 2 for the Gaussian case. 

Now consider fic < 2/3. For x^^"^ < yUcv/2, use u = 2vx^/^ and //o = ujv^ . It 
holds /.io = n/v^ < ix^ and ii^/(4v^) = x yielding the desired bound by (A. 17). For 
x^/^ > , we select again /xo = A'c • It holds with u = 4/i~"'^x that /icii/2 — /U^v^/4 > 

2x — X = X . This completes the proof. □ 



(A.17) 
= 2/3. The 
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Now we describe the value }{x,lB) ensuring a small value for the large deviation 
probability ]P{\\B$f >3(x,iB)). For ease of formulation, we suppose that g2 > 2po 
yielding fi~^ < 3/2. The other case can be easily adjusted. 

Corollary A. 12. Let ^ fulfill (A.l) with > 2po . Then it holds for x < with x^ 
from (A.12).- 

lP{\\mf >ii^,]B)) < 2e-=^ + 8.4e-"% 



, act .Po + 2vxi/2, x<v/18, 
3(x,iB) = { (A.18) 
po + 6x v/18 < X < Xc- 



For X > Xc 

PiWB^f > 3c(x, B)) < 8.4e-^ ^^(x, IB) = \y, + 2(x - x,)/g,\\ 
A. 4 Rescaling and regularity condition 

The result of Theorem A. 8 can be extended to a more general situation when the condition 
(A.l) is fulfilled for a vector rescaled by a matrix Vq . More precisely, let the random 
p -vector fulfills for some p x p matrix Vq the condition 

sup logiEexp(A^-\) < |A| < g, (A.19) 

-yeiRp ^ lro7lK 

with some constants g > , lyo > 1 ■ Again, a simple change of variables reduces the 
case of an arbitrary I'o ^ 1 to = 1 . Our aim is to bound the squared norm 
of a vector -Dq'^C foi' another px p positive symmetric matrix . Note that condition 
(A.19) implies (A.l) for the rescaled vector ^ = Vf^^C- This leads to bounding the 
quadratic form ^Vq^P = ll-^^P with B"^ = Dq^V^Dq^ . It obviously holds 

Po = tT{B^) = tT{D^X^). 

Now we can apply the result of Corollary A.12. 

Corollary A. 13. Let ^ fulfill (A.19) with some Vq and g. Given Dq , define B^ = 
Dq^VqDq^ , and let g^ > 2po . Then it holds for x < Xc with x^ from (A.12).- 

B{\\D^^Cf >]>{x,B)) < 2e-" + 8.4e-"% 

with j{x,B) from (A.18). For x > Xc 

PiWD^^Cf > 3c(x, B)) < 8.4e-^ 3e(x, B) = \y, + 2(x - x,)/g,\\ 
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Finally we briefly discuss the regular case with Dq > aVo for some a > . This 
implies ||iB||oo < and 

= 2tr(iB'') < 2a"2po. 
A. 5 A chi-squared bound with norm-constraints 

This section extends the results to the case when the bound (A.l) requires some other 
conditions than the £2 -norm of the vector 7 . Namely, we suppose that 

logiEexp(7^^) < II7IIV2, 7 6 IItIIo < go, (A.20) 

where || • ||o is a norm which differs from the usual Euclidean norm. Our driving example 
is given by the sup-norm case with ||7||o = ||7||oo- We are interested to check whether 
the previous results of Section A. 2 still apply. The answer depends on how massive the 
set A{r) = {7 : ||7||o < r} is in terms of the standard Gaussian measure on . Recall 
that the quadratic norm of a standard Gaussian vector e in iR^ concentrates 

around p at least for p large. We need a similar concentration property for the norm 
II • ||o • More precisely, we assume for a fixed r* that 

iP(lkl|o < r,) > 1/2, s ~ ?^(0, Ip). (A.21) 

This implies for any value Uq > and all u G with ||tt||o < Uq that 

iP(||e - m||o < + Uo) > 1/2, e ~ ?^(0, Ip). 

For each 3 > p , consider 

Ki) = {l-p)/h- 
Given Uq , denote by 30 = 3o(uo) the root of the equation 

^° -Uo. (A.22) 



as the largest 3 for which — > Uq . Let //o = A*(3o) be the corresponding 



One can easily see that this value exists and unique if Up > go — r^, and it can be defined 
as the largest 3 for which 
//-value. Define also Xq by 

2xo = /Uo3o +p\og{l - jlo)- 



If Uo < go — , then set 30 = 00 , Xq = 0x3 . 
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Theorem A. 14. Let a random vector ^ in IRP fulfill (A. 20). Suppose (A. 21) and let, 
given Uq , the value be defined by (A. 22). Then it holds for any u > 

]P{Uf >p + u,U\\o<^o) <2exp{-{p/2)cj){u)]]. (A.23) 

yielding for x < Xq 

^(ll^f >P + \/><x^V(xx), ll^llo <Uo) <2exp(-x), (A.24) 

where x = 6.6 . Moreover, for 3 > 3o ; it holds 

IP{Uf>h ||^||o<Uo) < 2exp{-^o3/2- (p/2)log(l-^o)} 

= 2exp{-Xo - go(3 - 3o)/2}. 

Proof. The arguments behind the resuh are the same as in the one-norm case of Theo- 
rem A. 4. We only outhne the main steps. 

Lemma A. 15. Suppose (A. 20) and (A. 21). For any fi < 1 with go > /x^^^r* , it holds 
]Eexp{^m\V'^) HUWo < go/ II - r*///2) < 2(1 - (A25) 

Proof. Let £ be a standard normal vector in IBP and u G IBP . Let us fix some ^ with 
/x^^^ll^llo < ^J'~'^^'^go — and denote by IP^ the conditional probability given ^. It holds 
by (A.21) with Cp = (27r)-P/2 

c, I exp(7TC - ^\hf) HdlTllo < go)d7 

= Cpexp(/.||^||V2) J exp(-i||/.i/2^-^"i/M|') liW^-^'jlU < f,-'/ho)dj 

= f,P/'exp{f,Uf/2)jP^{\\s-f,'/H\\o<l^~'/ho) 
> 0.5^^/2 exp(^||^||V2). 

This implies 

exp(^)ll(||^||o<go^-r.///2) 

< 2f,~P/\J exp(7^^ - i-||7f ) IdlTllo < go)d^. 

Further, by (A.20) 

CpJE I exp(7^| - ^hf ) lI(ll7l|o < go)d7 

<c,J exp(-^^:^ii7ip)d7 < if.-' - ir^/' 
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and (A. 25) follows. □ 

As in the Gaussian case, (A. 25) implies for 3 > p with n = = (3 —p)/l the 
bounds (A. 23) and (A. 24). Note that the value ^(3) clearly grows with 3 from zero to 
one, while go/z^K^) ~ y^^^'^il) is strictly decreasing. The value 30 is defined exactly as 
the point where go/^(3) — r^^/ ^^/"^{t)) crosses Uo , so that go/fJ'ii) — ?'*/Ai^^^(3) > for 
3 <3o. 

For 3 > 3o , the choice fi = /u(y) conflicts with go/fJ-d) — f^/fi^^'^{^) > Uq . So, we 
apply /i = /io yielding by the Markov inequality 

lP{Uf > h UWo < Uo) < 2exp{-/.o3/2 - (p/2) log(l - fio)}, 

and the assertion follows. □ 

It is easy to check that the result continues to hold for the norm of IJ^ for a given 
sub-projector U in IW satisfying 77 = 77^ , < U . As above, denote po *== tr(7r^) , 
=^ 2tr(7Z"^) . Let be fixed to ensure 

IP{\\ne\\o <r,) >l/2, e^y^iO, Ip). 

The next result is stated for go > ?'* + Uo , which simplifies the formulation. 

Theorem A. 16. Let a random vector ^ in ]RP fulfill (A. 20) and 77 follows 77 = 77"'' , 
< n . Let some Uq be fixed. Then for any /^o 2/3 with gofJ-o^ ~ f*tJ'o ^ Uq , 

iE;exp{^(||7r^||2-po)}ll(||7r2^||o<Uo) < 2exp(/.y/4), (A.26) 

where = 2tr(7Z"^) . Moreover, if go > + Uo , then for any 3 > 

iP(||7r^f >3,||7r2^||o <uo) 

< ]P{\\n^f > po + (2vxi/2) V (6x), ||772^||o < Uo) < 2exp(-x). 

Proof. Arguments from the proof of Lemmas A. 9 and A. 15 yield in view of goA*^^ — 

-1/2 . 
r*/Xo > Uo 

iEexp{;Uo||7r^||V2} ^\\n^i\\o < Uo) 

< 7Eexp(^o||77^||V2) I[(ll^'^l|o < go/^^o - Po//iy') 

< 2det(7p - HoH'^y^/'^. 

Now the inequality log(l — t) > —t — t^ for t < 2/3 implies 

- logdet(7p - iioH'^) < /UoPo + I^W /2 
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cf. (A. 17); the assertion (A. 26) follows. □ 

A. 6 A bound for the £2 -norm under Bernstein conditions 

For comparison, we specify the results to the case considered recently in Y. Baraud 
(2010). Let ^ be a random vector in iR" whose components Q are independent and 
satisfy the Bernstein type conditions: for all |A| < 

log JEe^'^' <-^^. (A.27) 
l-c|A| 

Denote ^ = C/{2a) and consider ||7||o = II7II00 • Fix go = a/c. If ||7||o < go, then 
1 - C7j/(2a) > 1/2 and 

i i 

Let also 5 be some linear subspace of IRJ^ with dimension po and Us denote the 
projector on S . For applying the result of Theorem A. 14, the value has to be fixed. 
We use that the infinity norm ||£||oo concentrates around ^/2\ogp . 

Lemma A. 17. It holds for a standard normal vector e € IHP with r* = t/2 logp 

lP{\\e\\o<r,) > 1/2. 

Proof. By definition 

P{Mo > n) < lP{\\e\\oo > V21ogp) < plP{\ei\ > V21ogp) < 1/2 
as required. □ 

Now the general bound of Theorem A. 14 is applied to bounding the norm of ||/7s'^|| . 
For simplicity of formulation we assume that go > Uo + • 

Theorem A. 18. Let S be some linear subspace of JR" with dimension po . Let go > 
Uo + . If the coordinates d of C cl^^ independent and satisfy (A.27), then for all x, 

IP{{4a^)-^\\nsCf > PO + V (xx), IlilsClloo < 2cJUo) < 2exp(-x), 

The bound of Baraud (2010) reads 

Ip(^\\nsCh > (3^ V ^/6^)v/x + 3po, IliT^CIloo < 2(JUo^ < e-^ 

As expected, in the region x < Xc of Gaussian approximation, the bound of Baraud is 
not sharp and actually quite rough. 
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B Some results for empirical processes 

This chapter presents some general results of the theory of empirical processes. We 
assume some exponential moment conditions on the increments of the process which 
allows to apply the well developed chaining arguments in Orlicz spaces; see e.g. van der 
Vaart and Wellner (1996), Chapter 2.2. We, however, follow the more recent approach 
inspired by the notions of generic chaining and majorizing measures due to M. Talagrand; 
see e.g. Talagrand (1996, 2001, 2005). The results are close to that of Bednorz (2006). We 
state the results in a slightly different form and present an independent and self-contained 
proof. 

The first result states a bound for local fluctuations of the process U(v) given on 
a metric space T . Then this result will be used for bounding the maximum of the 
negatively drifted process 'U.{v) — 'U.{vq) — p(P{v,vo) over a vicinity T'o(r) of the central 
point vq . The behavior of U{v) outside of the local central set T'o(r) is described using 
the upper function method. Namely, we construct a multiscale deterministic function 
u(/x, v) ensuring that with probability at least 1 — e^^ it holds fiU{v) + u(/i, v) < ^(x) 
for all V T'o(r) and /x € M , where ^(x) grows linearly in x . 

B.l A bound for local fluctuations 

An important step in the whole construction is an exponential bound on the maximum 
of a random process U{v) under the exponential moment conditions on its increments. 
Let d{v,v') be a semi-distance on T . We suppose the following condition to hold: 

(£d) There exist g>0, ro>0, fo>l, such that for any A < g and v,v' £ T with 
d{v,v') < ro 



Formulation of the result involves a sigma-finite measure vr on the space T which 
is often called the majorizing measure and used in the generic chaining device; see 
Talagrand (2005). A typical example of choosing vr is the Lebesgue measure on ]RP . 
Let T° be a subset of T , a sequence be fixed with ro = diam(T°) and = ro2~^ . 
Let also 'Bfc(tj) {v' € T° : d{v,v') < r^} be the d-ball centered at v of radius 
and TTkiv) denote its vr -measure: 




(B.l) 
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Denote also 



Mk =^ max k>l. (B.2) 

ver° TTkiv) 



Finally set ci = 1/3, Ck = 2~''+^/3 for k>2, and define the value Q(T°) by 

OO /I °° 



3 

fc=l k=2 



Theorem B.l. Let XL be a separable process following to {£,d) . If T° is a d-ball in 
T with the center v° and the radius tq , i.e. d{v,v°) < tq for all v & T° , then for 
A < go = fog 

logiEexpl-^ sup |U(t^) < AV2 + Q(r°). (B.3) 

Proof. A simple change 'U(-) with i/(^^U(-) and g with go = z^og allows to reduce the 
result to the case with vq = \ which we assume below. Consider for A; > 1 the smoothing 
operator defined as 

Further, define 

EoU{v) = U{v°) 

so that is a constant function and the same holds for SkE^-i ■ ■ ■ §0^ with any 

k > 1 . If /(•) < for two non- negative functions / and g , then §*;/(•) < • 
Separability of the process U implies that lim^ §fcU(v) = Vi{v) . We conclude that for 
each V T° 



\U{v) - U{v°)\ = lim \EkU{v) -8k... SoU{v) 

fc— i-CJO 



k 

^ ^^l^Y.\^k...Ei{I-E^-lmv)\<Y,c:■ 



1=1 1=1 



dcf 



Here = sup^^j-o ^^(i;) for k>l with 

6H = |SiUH-iiK)|, ^kiv) = \Sk{i-Sk^Mv)\, k>2 



For a fixed point v'^ , it holds 

T^k{v^) Jua-v^) T^k-l{v) ySfc_i(v) 



ik{v^) < ] f / \U{v)-U{v')W{dv')TT{dv). 
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For each v' G 23,fc_i(i;) , it holds d{v,v') < r/^^i = Ir^ and 

\U{v)-U{v')\ 



\U{v)-U{v')\<rk^ 



d{v, v') 

This imphes for each G T° and /c > 2 by the Jensen inequahty and (B.2) 

A f ff \\U{v)-U{v')\ Tr{dv') \TT{dv) 



exp Cki-o') < / / exp 



X\U{V) - U{v')\ TT{dv') \TT{dv) 



dcf 



As the right hand-side does not depend on v\ this yields for = sup^gj-o ^^(l;) by 
condition (S-d) in view of e'^'l < + 

/ A A f /f X\U{v)-U{v')\ TT(dv') \Tr(dv) 

Vrfc_i / VAfc-iCt^) 7rfc_i(t;) y 7r(r°) 

<2M,exp(AV2)/ (/ <dv') \.id.) 



2Mfcexp(AV2). 



Further, the use of d{v,v°) < rg for all v yields by (£(i) 

A_ 

and thus 



iEexp|^|U(-u) -U(-u°)|} < 2exp(AV2) (B.4) 



iEexpl — |Si'U(t;) - UK)|| < — - / JEexpf — - U(-u°)| 

^ iEexp|A|u(t^') -'"K)l}^(d«')• 



This implies by (B.4) for ^* = sup^gj-o |Silt('u) - lt('u°)| 

iEexpf— < 2Miexp(AV2). 

Denote ci = 1/3 and Cfc = rfc_i/(3ro) = 2~'^'+V3 for k>2. Then Efeli Cfc = 1 and it 
holds by the Holder inequality; see Lemma B.16 below: 

log E exp A < ci log E exp A^^) + f; Cfc log exp (y-^Ck 



< AV2 + ci log(2Mi) + ^ Cfc log(2Mfc) 

v2 



< A72 + Q(r°). 
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This implies the result. □ 

The exponential bound of Theorem B.l can be used for obtaining a probability bound 
on the maximum of the increments U{v) — U{v') over T° . 

Corollary B.2. Suppose {S.d) . If T° is a central set with the center v° and the radius 
ro , then it holds for any x > 

ip( sup U{v,vo) > 3(x,Q) ) < exp(-x), (B.5) 
where with go = z^og o-iT-d Q = Q(^°) 



V2(x + Q) if V2(x + Q) < go, 

gQ ""^(x + Q) + go/2 otherwise. 



(B.6) 



Proof. By the Chebyshev inequality, it holds for the r.v. ^ =^ sup^g^-o U{v, tJo)/(3foro) 
for any A < go by (B.3) 

logiP(e>3) < -A3 + logiEexp{Ae} < -A3 + AV2 + Q. 



Now, given x > , we choose A = \/2{x + Q) if this value is not larger than go , and 
A = go otherwise. It is straightforward to check that A3 — A^/2 — Q > x in both cases, 
and the choice of 3 by (B.6) yields the bound (B.5). □ 



B.2 Application to a two- norms case 

As an application of the local bound from Theorem B.l we consider the result from 
Baraud (2009), Theorem 3. For convenience of comparison we utilize the notation from 
that paper. Let T be a subset of a linear space S of dimension D , endowed with two 
norms denoted by d{s,t) and d{s,t) for s,t£T. Let also {Xt)teT be a random process 
on T . The basic assumption of Baraud (2009) is a kind of a Bernstein bound: for some 
fixed c > 

logiE;exp{A(Xi-X,)}<-^^^M!/l, if XcS{s,t)<l. (B.7) 

1 — Aco(^s, t J 

The aim is to bound the maximum of the process Xt over a bounded subset Ty^f, defined 
for v,b > and a specific point to as 

= {t : d{t,to) < V, c6{t,to) < b}. 



Let Q = CiD with ci = 2 for Z) > 2 and ci = 2.7 for D = l. 
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Theorem B.3. Suppose that {Xt)t£S fulfills (B.7), where S is a D -dimensional linear 
space. For any p < 1 , it holds 

log IP (^^^^ sup (Xt-Xt, ) > 3(x,Q)) < -X (B.8) 

where 3(x, Q) from (B.6) with go = p(l — p)~^^'^ b~^v . 
Proof. Define the new semi-distance d*{s,t) by 

d*{s, t) = max {d{s,t),b~^vc5{s,t)}. 

The set Ty^b can be represented as 

n,b = {t : d*{t,to) < v} 

Moreover, Lemma B.IO apphed for the semi-distance d*{t,s) yields Q(?l,^fe) < CiD , 
where Ci = 2 for D > 2 , and Ci = 2.4 for D = 1 . 

Fix some p < 1 and define g = pb~^v . Then for |A| < g, it holds 

-c6{s,t) < —— — ^- -c6{s,t) < p 



d*{s,t) ^ ' ^ - b-'^vc6{s,t) 
and by (B.7), it follows with i/q = (1 — p)"^ 

Xf — Xt, 1 , „ f , Xf — Xi, ) A'^/2 Z/'nA 



logiEexp<^ A— ^- ^ } < logJEexp<^ A^- } < — — < 

^ U d*(sA) f - ^ H d s,t) f-l-p- 



2 \2 



d*{s,t) ) ~ I d{s, t) ) - 1- p - 2 

So, condition {8,d) is fulfilled. Now the result follows from Corollary B.2. □ 

If V is large relative to b, then g = pv/b is large as well. With moderate values of 
x, this allows for applying the bound (B.8) with ^(x, Q) = y^2(x + Q) . In other words, 
the value 3 ~ 3(x, Q) ensures that the maximum of Xt — Xt^ over t € T^^i, deviates over 
3^3 with the exponentially small probability e"^ . 

B.3 A local central bound 

Due to the result of Theorem B.l, the bound for the maximum of U{v,vq) over v € 
'Bj-(t^o) grows quadratically in r. So, its applications to situations with ^ Q{T°) 
are limited. The next result shows that introducing a negative quadratic drift helps 
to state a uniform in r local probability bound. Namely, the bound for the process 
U{v,vq) — pd'^{v,VQ)/2 with some positive p over a ball 'Bj-(i^o) around the point 
vq only depends on the drift coefficient p but not on r . Here the generic chaining 
arguments are accomplished with the slicing technique. The idea is for a given r* > 1 
to split the ball 'Bj-*{vq) into the slices 'Bj.^i{vq) \ Sj-(i^o) a-nd to apply Theorem B.l 
to each slice separately with a proper choice of the parameter A . 
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Theorem B.4. Let r* be such that (gd) holds on 'B^'ivo) . Let also Q{T°) < Q for 
T° = Sr(t^o) with r < r* . If p > and 3 are fixed to ensure \J1pi < go = t'og o.'^d 
pid — 1) > 2 , then it holds 

logipf sup \^U{v,vo)-^d\v,vo)]>^) 

< - 1) + log(43) + Q. (B.9) 
Moreover, if \/2p^ > go , i/ien 

logipf sup \^U{v,vo) - ^d^{v,vo)\ > i 



< -go\/p(3 - 1) + go/2 + log(43) + Q. (B.IO) 

Remark B.l. Formally the bound applies even with r* = 00 provided that (E-d) is 
fulfilled on the whole set T° . 



Proof. Denote 



sup {U(^') - If(-oo)}- 

Then we have to bound the probability 

iP ( sup { r u(r) -prV2} > 3). 

r<r* 

For each r < r* and A < go , it follows from (B.3) that 

logiEexp{Au(r)} < + Q. 

The choice A = \/2p^ is admissible in view of \J1pi < go . This implies by the exponential 
Chebyshev inequality 

log iP(r u(r) - prV2 > 3) < -A(3/r + pr/2) + AV2 + Q 

= -Pi{x + x-^ (B.ll) 



where x = \J p/{2i) r . We now apply the slicing arguments w.r.t. t = pr^/2 = 32;^. 
By definition, ru(r) increases in r . We use that for any growing function /(•) and any 
t > , it holds 



[■t+i 

f{t)-t< {f{s)-s + l}ds 
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Therefore, for any t > , it holds by (B.ll) in view of dt = 2jxdx 

ft*+i 



\dt 



ip(^ sup{ru(r) - prV2} > < iP{ru(r) - t > 3 - l}( 

< 23 / exp{-/3(3 - + - 1) +Q}xdx 

Jo 

/•oo 

< 236"*+^ / exp{-b{x + x-'^ - 2)} X dx 

Jo 

with b = p{i — 1) and t* = pr*^/2 . This imphes for 6 > 2 

iP ^ sup {r u(r) -/>rV2} > 3^ < 236"*+'^ ^ exp{-2(x + x"^ - 2)}x 

< 43exp{-/3(3-l) + Q} 

and (B.9) fohows. 

If \/2p3 > go , then select A = go • For r < r* 

logiP{ru(r)-prV2>3} = logiP{u(r) > 3/r + pr/2} 

< -A(3/r + pr/2) + \^ /2 + Q 

< -\^{x + x-^ - 2)/2 - AVp3 + + Q, 

where x = p/3 r . This allows to bound in the same way as above 

ip(^sup{ru(r)-prV2} > 3) < exp(-AVp(3 - 1) + \^ /2 + Q) 

yielding (B.IO). □ 

This result can be used for describing the concentration bound for the maximum of 
(3fo)~^'U.(^, 'L'o) — pd^{v,vo)/2 . Namely, it suffices to find 3 ensuring the prescribed 
deviation probability. We state the result for a special case with p = I and go > 3 
which simplifies the notation. 

Corollary B.5. Under the conditions of Theorem B.4, for any x > with x + Q > 4 
jpf sup \^U{v,vo)-ld^{v,vo)} >io{x,Q)] <exp{-x), 



where with go = z^og ^ 2 



3o(x,Q) = 



l + V^T+Q)' ifl + V^TQ<go, ^^^^^ 

(B.12j 

^l + {2go^(x + Q) + go}^ otherwise. 
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Proof. First consider the case 1 + \/x + Q < go • In view of (B.9), it suffices to check 
that 3 = (l + t/x + Q) ^ ensures 

3 - 1 - log(43) - Q > X. 

This fohows from the inequahty 

(l+y)2_i_21og(2 + 2y) >y2 

with y = Vx + Q > 2 . 

If 1 + ^/^TTQ > go , define 3 = 1 + y2 with y = 2go ^(x + Q) + go • Then 

go\/3^- log(43) - gl/2 - Q - X = goy/2 - log{4(l + y^)} > 

because go > 3 and 3y/2 - log(l + y^) > log(4) for y > 2 . □ 

If g 3> \/Q and X is not too big then 3o(x,Q) is of order x + Q. So, the main 
message of this result is that with a high probabihty the maximum of {3i'o)~^U{v, vq) — 
(P{v,vq)/2 does not significantly exceed the level Q. 

B.4 A multiscale upper function and hitting probability 

The result of the previous section can be explained as a local upper function for the pro- 
cess IX(-) . Indeed, in a vicinity 'Bj-*{vq) of the central point t;o , it holds {3i'q)~^'U.{v,vq) < 
d?{v,VQ)/2 + ^ with a probability exponentially small in 3 . This section aims at extend- 
ing this local result to the whole set T using multiscaling arguments. For simplifying 
the notations assume that 'U(i>o) = 0. Then U{v,vq) = U{v) . We say that u{fi,v) is 
a multiscale upper function for /x'U,(-) on a subset T° of T if 

jpfsup sup{^U(-t')-u(/x,t')} >3(x)) <e"'', (B.13) 

for some fixed function 3(x) . An upper function can be used for describing the concen- 
tration sets of the point of maximum v = argmax^gj-o Miv) ; see Theorem B.8 below. 

The desired global bound requires an extension of the local exponential moment 
condition [Zd) . Below we suppose that the pseudo- metric d{v^ v') is given on the whole 
set T . For each r this metric defines the ball T"o(r) by the constraint d{v,VQ) < r. 
Below the condition {S.d) is assumed to be fulfilled for any r , however the constant g 
may be dependent of the radius r . 

(£r) For any r , there exists g(r) > such that (B.l) holds for all v,v' € To{r) and 
all A < g(r) . 



58 



Parametric estimation. Finite sample theory 



Condition (£r) implies a similar condition for the scaled process fiU{v) with g = 
/i~^g(r) and d{v,v') replaced by fid{v,v'). Corollary B.5 implies for any x with 
1 + V^TQ < go(r) =' i^og{r)/i^ 

JP( sup |-^ltH-i/iV| >3o(x,Q)) <exp(-x). (B.14) 

Let now a finite or separable set M and a function > 1 be fixed such that 

e-*^'^) < 2. (B.15) 

One possible choice of the set M and the function t(/u) is to take a geometric sequence 
/Ufc = A*o2~'^ with any fixed /xq and define t(/ifc) = k = log2(/Ufc//io) ^or k > . 

Putting together the bounds (B.14) for different /x G M yields the following result. 

Theorem B.6. Suppose (£r) and (B.15). Then for any x > 2, there exists a random 
set A{x) of a total probability at least 1 — 2e~^ , such that it holds on A{x) for any r 



sup sup 

ueSrC^o) /ieM(r,x) 



< 0, 



where 



m{r,x) = {fieM: 1 + Vx + Q + t(/i) < z^og(r)//i}. 
Proof. For each /_i € M(r, x) , Corollary B.5 implies 

1P( sup J^U{v) - wi2r2 > {1 + ^x + Q + t(/.)}') < e-^-*^'^). 

The desired assertion is obtained by summing over /i G M due to (B.15). □ 
Moreover, the inequality x + Q > 2.5 yields 

{1 + y/x + q + tifi)Y < 2{x + Q + t(/i)}. 

This allows to take in (B.13) u{n,v) = 3i/o{/U^rV2 + 2t(/i)} and }{x) = 2(x + Q) . 

Corollary B.7. Suppose (£r) and (B.15). Then for any x with x + Q > 2.5, there 
exists a random set f2{x) of a total probability at least 1 — 2e~^ , such that it holds on 
i7(x) for any r 

sup sup {^U{v) - l^i^"^ - 2t{fi)} < 2(x + Q). 
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Now we briefly discuss the hitting problem. Let M(v) be a deterministic boundary 
function. We aim at bounding the probability that a process Vi{v) hits this boundary 
on the set T . This precisely means the probability that sup^gj-|U(i') — M(t^)} > 0. 
An important observation here is that multiplication by any positive factor ^ does not 
change the relation. This allows to apply the multiscale result from Theorem B.6. For 
any fixed x and any v G !Br('L'o) ) define 

ajt*H^=if sup |J-;,Af(,;)-l;.V-2t(^)|. 

Theorem B.8. Suppose (fir), (B.15), and x + Q > 2.5. Let, given x, it hold 

m*{v) > 2(x + Q), ve r. (B.16) 

Then 

lp(sup{U{v) - M{v)} >0) < 2e"^ 

Maximizing the expression {^vq)~^ yuM{\>)—^i^x^ 12 suggests the choice [i = M{v) j (?)V<^x^^ 
yielding m*{y) > (v) / {Gi^^r'^) - 2t{ij,) . In particular, the condition (B.16) requires 
that M{v) grows with r a bit faster than a linear function. 



B.5 Finite-dimensional smooth case 

Here we discuss the special case when T is an open subset in , the stochastic pro- 
cess U{v) is absolutely continuous and its gradient VU.{v) =^ d'U.{v)/dv has bounded 
exponential moments. 

(£D) There exist g > , i^o ^ 1 > for each v T , a symmetric non-negative 
matrix H{v) such that for any A < g and any unit vector 7 G , it holds 

A natural candidate for H'^{v) is the covariance matrix Var(VU(w)) provided that 
this matrix is well posed. Then the constant vq can be taken close to one by reducing 
the value g ; see Lemma B.17 below. 

In what follows we fix a subset T° of T and establish a bound for the maximum of the 
process \i{v,v°) = U(w) — \i{v°) on T° for a fixed point v° . We will assume existence 
of a dominating matrix H* = H*{T°) such that H{v) ^ H* for ah v £ T° . We also 
assume that vr is the Lebesgue measure on T . First we show that the differentiability 
condition {S,D) implies (fid) . 
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Lemma B.9. Assume that (S-D) holds with some g and H{v) ^ H* for v £ T° . 
Consider any v,v° G T° . Then it holds for |A| < g 



7/2x2 



^ \ \\H*{v - v°)\\ J - 

Proof. Denote (5 = — , 'y = {v — v°)/5 . Then 

U{v, v°) = / Vn{v° + t5-f)dt 
Jo 

and \\H*{v - v°)\\ = 6\\H*-f\\ . Now the Holder inequahty and {E.D) yield 

U{v,v°) vl\^ 



iE;exp<^ A 



lEexp 



j^VU{v° + t5f) z/^A^n 



A- 



\H*j\ 



dt 



f 

Jo 



as required. □ 

The result of Lemma B.9 enables us to define d{v,v') = \\H*(v — v°)\\ so that the 
corresponding ball coincides with the ellipsoid i?(r, v°) . Now we bound the value Q(T°) 
for T° = B{ro,v°). 

Lemma B.IO. Let T° = B{rQ,v°) . Under the conditions of Lemma B.9, it holds 
"°) ^ Cip , where Ci = 2 for p > 2 , and ci = 2.7 for p = 1 . 



Proof. The set T° coincides with the ellipsoid B(i:q,v°) while the d-ball 'Bfc(D) coin- 
cides with the ellipsoid B{rk,v) for each k>2. By change of variables, the study can 
be reduced to the case with v° = , H* = Ip , ro = 1 , so that B{r,v) is the usual 
Euclidean ball in of radius r . It is obvious that the measure of the overlap of two 
balls -6(1,0) and B{2~^^^ ,v) for \\v\\ < 1 is minimized when \\v\\ = 1, and this value 
is the same for all such v . 

Now we use the following observation. Fix v"^ with Hf"]] = 1. Let r < 1, v'" = 
(1 - r^/2)v^ and = r - . If G B{r\v^) , then v G B{r,v^) because 

\\v^ - v\\ < \\v^ - v^W + \\v - ^'''11 < r^/2 + r - r^/2 < r. 

Moreover, for each v G B{r^ ,v^) , it holds with u = v — 

\\vf = \\v^f + \\uf + 2u'^v^ < (1 - T^/2f + |r^|2 + 2u'^v^ < 1 + 2u'^v\ 
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This means that either v = + u or iJ' — u belongs to the bah i?(ro,f °) and thus, 
7r(S(l,0)n-B(r,'u)) > 7^(S(r^ ?j^))/2 . We conclude that 

7r(i?(l,0)) ^ 27r(S(l,0)) _ , ^ 



< =2(r-r72)-P. 



^(5(1,0) nS(r,'i;tt)) - 7r(B(r^0)) 

This implies for /c > 1 and = 2"*^+^ that 2Mfc+i < 22+^p(1-2"'=-1)-p . The quantity 
can now be evaluated as 

1 „ OO r, oo 

°) < i log(22+P) + I 2-'^ log(22+'=^) - f 2-^^ log(l - 2-^-^) 

k=l k=l 



OO ^ OO 



fc=i fc=i 



3 

where Ci = 2 for p> 2 , and ci = 2.7 for p = 1 , and the result follows. □ 



Now we specify the local bounds of Theorem B.l and the central result of Corol- 
lary B.5 to the smooth case. 



logiE;exp<j sup \U{v)-U{v°)\\ <X'/2 + Q, 



Theorem B.ll. Suppose {8.d) . For any A < fog, tq > , and v° & T 

sup 

where Q = Cip . 

We consider the local sets of the elliptic form To{r) '= {v : ||i^o(''^ ~ "^^o)!! ^ ^} ) 
where Hq dominates H{v) on this set: H[v) ^ Hq . 

Theorem B.12. Let (S-D) hold with some g and a matrix H{v) . Suppose that H{v) ^ 
Hq for all V G To{r) . Then 

ip( sup \-Lu(v,'Vo)-hHoiv-vo)f}>Ux,p)] <exp(-x), (B.17) 
where ^o{x,p) coincides with 3o(x,Q) from (B.12) for Q = cip . 

Remark B.2. An important feature of the established result is that the bound in the 
right hand-side of (B.17) does not depend on the value r describing the radius of the 
local vicinity around the central point vq . In the ideal case one would apply this result 
with r = OO provided that the conditions H{v) < Hq is fulfilled uniformly over T . 

Proof. Lemma B.IO implies (£(i) with d{v,vo) = \\Ho{v — uo)|p/2. Now the result 
follows from Corollary B.5. □ 

The global result of Theorem B.6 applies without changes to the smooth case. 
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B.6 Roughness constraints for dimension reduction 

The local bounds of Theorems B.l and B.4 can be extended in several directions. Here 
we briefly discuss one extension related to the use of a smoothness condition on the 
parameter v . Let let t{v) be a non-negative penalty function on T . A particular 
example of such penalty function is the roughness penalty i{v) = ||Gt^|p for a given 
matrix IR^ . Let to > 1 be fixed. Redefine the sets 23r(f °) by the constraint t{v) < to : 

BrK) = {veT: d{v,v°) < r; i{v) < to}, 

and consider T° = 'B^^oi'^") fo^ & fixed central point v° and the radius Tq . One can 
easily check that the results of Theorems B.l and B.4 and their corollaries extend to 
this situation without any change. The only difference is in the definition of the value 
Q(T°) and Q. Each value Q{T°) is defined via the quantities T^kiv) = tt['Bj.^{v)) 
which obviously change when each ball 'Bj-(i;) is redefined. Examples below show that 
the use of the penalization can substantially reduce the value Q . 

Now we specify the results to the case of a smooth process U given on a local 
ball T° = 'Bt-^{v°) defined by the condition {||i^o(''^ — ■'^°)|| ^ ^o} and a smoothness 
constraint ||Gt^|p < to = . The local set 'Bj-(f ) are of the form: 

Sr(-u) = {v' : \\Ho{v - v')\\ < r, \\Gv'\\ < ro}. (B.18) 

The effective dimension pe = Pe{S) can be defined as the dimension of the subspace 
on which Hq > 9 • The formal definition uses the spectral decomposition of the matrix 
S = Hq^S^Hq^ . Let < 52 < • • • < 5p be the eigenvalue of S . Define Pe{S) as the 
largest index j for which gj < 1: 

Pe{S) = max{i >1: gj<l}. (B.19) 

In the non-penalized case, the entropy term Q is proportional to the dimension p . 
The roughness penalty enables to reduce p to the effective dimension PeiS) which can 
be much smaller than p depending on the relation between matrices Hq and S . More 
precisely, if the eigenvalues gj of S grow sufficiently fast, the entropy calculus effectively 
reduces to the coordinates with gj < I . 

Lemma B.13. Let gi = . For each Tq > 1 , it holds 

Q(ro(ro)) < tiPs (B.20) 

with ps = Ps{S) defined by 

p 

PsiS) = pe{S) + ^gjHog+{gj). (B.21) 
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If the sum J2j>i9j ^^S+idj) is bounded by a fixed constant, then the value ps is 
close to the effective dimension PeiS) from (B.19). 

Proof. We follow the proof of Lemma B.IO. By a change of variables one can reduce 
the problem to the case when Hq is the identity matrix and Tq = 1 . Moreover, one 
can easily see that v° = is the hardest case. The case v° ^ can be considered 
similarly. By a further change the matrix S = 9^ can be represented in diagonal form: 
S = diag{(7^ < ... < dp}- Let "i^" be any point with Hi^^H < 1 and HSi^^H < 1, and 
r < 1 . Simple arguments show that the measure of the set 'Bj-{v^) over all such vectors 
is minimized at i;^ = (1, 0, . . . , O)'^ . Define = r - jl , = {\ - r"^ /2)v^ . Fix 
any v such that u = v — fulfills \\u\\ < , ||Sit|| < r , and ui < . As = , it 
holds 

\\9v\\ = \\9u\\ < r < 1, 
\\v - < \\v -v^\\ + \\v^ - < + = r, 

\\vf = \\v' + uf = \\v^f + \\uf + 2wV < (1 - rV2)' + |rf < 1. 

This yields that 7r(:Bi(0) n 'B^{v^)) > 7r(S^b (0)) /2 . Moreover, let the index pe{r) be 
defined as the largest j with gj < r. Consider any v G ^i(O) and construct another 
point v{r) by multiplying with r every element Vj for j < Pei^) ■ The construction 
ensures that i;(r) G 23r(0) . This implies 

7r(Si(0)nSrM)) " vr(S^.(0)) " ' ' 

Application of this bound for k > 1 , r^+i = 2~^ , and pk = Pe{^k+i) yields that 
2Mfc+i < 22+^'Pfc(l - 2-'=-i)-Pfe . The quantity Q{T°) can now be evaluated as 

Q(r°) < -log(22+P=) + - J]2-Mog(22+'=f^) - - J]2-Vfclog(l - 2-^-1). 

k=l k=l 
Further, for each g > 1 , it holds with k{g) = max{A; : g < l''} 

oo 

sig) ''^^f J]2-'=+iA;ll(2-'=5 < l) < 2k{g)2~''(s'^ < 2k{g)/g < 2g~Hog^{2g). 
k=i 

Thus, 

oo oo p p 

Y.2-^kpk < j;2-'^A:^]l(2-% < 1) < 2 

k=l k=l j=l j=l 

This easily implies the result (B.20); cf. the proof of Lemma B.IO. □ 
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The first result adjusts Theorem B.ll to the penalized case. The maximum of the 
process U is taken over a ball 'Bj-(f ) from (B.18) which is smaller than the similar ball 
in the non-penalized case. This explains the gain in the entropy term Q . 



Theorem B.14. Let {ED) hold with some g and a matrix H{v) ^ Hq for all v & T . 
Then for any A < z^og > > 1 , and v° ^ T 

X 

— su 



logiEexpj^— sup \U{v) - U{v°)\} < /2 + ' 



where 'Bj-^{v°) is given by (B.18), Q = tiPs , and ps is the effective dimension from 
(B.21). 

Proof. The result follows from Corollary B.5. It is only required to evaluate the local 
entropy Q(To(ro)) . This is done in Lemma B.13. □ 

The magnitude of the process U over 'Bj.^(v°) is of order ro and it grows with Tq . 
The use of the negative drift allows to establish a unified result. 

Theorem B.15. Let Tq > 1 be fixed and let (ED) hold with some g and a matrix 
H{v) < Ho for all v G 'Br,{vo) . Then 

ip( sup \^U{v,vo)-l\\Ho{v-vo)f}>i{x,q)) <eM-^), 

where 3(x,Q) is given by (B.12) with Q = tiPs ■ 

The result of Theorem B.6 for the non-penalized case applies without big changes to 
the penalized case. 

B.7 Auxiliary facts 

Lemma B.16. For any r.v. 's and Ajt > such that A = Aj^ < 1 



logiEexp(^Afea) < ^AfclogiEe«^ 



k ^ k 

A 



Proof. Convexity of e^ and concavity of x imply 



IE exp l^jY, Xk {Ck- log lEe^>^)^ < iE^ expj ^ ^ A^ (^^ - log iEe«'=) | 

< 1^ J^AfciBexp(efc-logiEe«^)| =L 



□ 
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Lemma B.17. Let a r.v. i fulfill IEi = Q, lE^'^ = 1 and lEexp{Xi\^\) = h < oo for 
some Ai > . Then for any g < \ there is a constant Ci depending on k , \\ and g 
only such that for A < gXi 

Moreover, there is a constant A2 > such that for all A < A2 

logiEe^^ > gX'^/2. 

Proof. Define h{x) = (A — Ai)a; + ?Tilog(a;) for m > and A < Ai . It is easy to see by 
a simple algebra that 

max n[x) = —m + m log ■ 



a;>0 Ai — A 

Therefore for any x > 



Xx + m log(x) < Aix + log 
This implies for all A < Ai 



m 



e(Ai - A) 



^^^^^J iEexp(Aiiei). 

Suppose now that for some Ai > , it holds iE'exp(Ai|^|) = x(Ai) < 00. Then the 
function ho{X) = iEexp(AO fulfills /io(0) = 1, /i^(0) = iE^ = , h'^{0) = 1 and for 
A < Ai , 

h'^iX) = lEfe^^ < iE^^eAICI < ^ iE exp(Ai|C|). 

(Ai - X)^ 

This implies by the Taylor expansion for A < gXi that 

/10(A) < 1 + C7iAV2 

with Ci = x(Ai)/{Af(l - g)^} , and hence, log/io(A) < CiX'^/2. □ 
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